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(54) Tide: A SYSTEM FOR PREDICTING FUTURE HEALTH 
(57) Abstract 



A computer-based system is disclosed for predicting future health of individuals 
containing a database of longitudinally-acquired biomarker values from individual 
aid members being identified as having acquired a specified biological condition 
subpopulation D being identified as not having acquired the specified biological 
and (b) a computer program that includes steps for (1) selecting from said bioirjarkers 
members belonging to the subpopulations D and D, wherein the subset of biom; 
values of the individual members of the lest population; and (2) using the distril 
procedure that is capable of being used for (i) classifying members of the test 
a prescribed high probability of acquiring the specified biological condition withi|i 
within a subpopulation PD having a prescribed low probability of acquiring the 
period or age interval; or (ii) estimating quantitatively, 'for each member of the 
biological condition within the specified time period or age interval. 



comprising: (a) a computer comprising a processor 
members of a test population, subpopulation D of 
within a specified time period or age interval and a 
condition within the specified time period or age interval; 
narkers a subset of biomarkers for discriminating between 
lArkers is selected based on distributions of the biomarker 
butions of the selected biomarkers to develop a statistical 
population as belonging within a subpopulation PD having 
the specified time period or age interval or as belonging 
specified biological condition within the specified time 
test population, the probability of acquiring the specified 
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A SYSTEM FOR PREDICT ING FUTIIRF HFai t» 



FIELD OF FNVENTION 

A computer-based system and method are disclosed folr predicting the fmure health of « 
individual. More particularly, the present invention predicts the future health of an individual 
by obtaining longitudinal data for a large number of bbmarkers from a large human 
population, statistically selecting predictive biomarkers, and determining and assessing an 
appropriate multivariate evaluation function based upon the selected biomarkers. 



an 



ms could be predicted for an 



BACKGROUND OF THE INVENTION 
It would be desirable if the onset of future health prob 
individual with sufficient reliability far enough into thq future so that the chances could be 
increased for preventmg future health problems for tha: individual rather than waning for 
actual onset of a disease and then treating the symptoms. At present, the ovenvhelmin 
fraction of medical research funding is directed toward 

treatment of disease rather than toward discovering preventive measures that could be 
d>rected toward reducing the risk of disease long before any of the typically observ ed 
symptoms of the disease are evident. Although the entasis on treatment of diseases 
have led to enormous advances in the medical sciences 

sophistication of the techniques and methods developed for diagnosing existing diseases as 
well as for treating the diseases after diagnosis, such ac vances continue to lead to ever- 
increasing costs for treatment. Such costs can have staggering financial consequences for 
individuals as well as for the entire society. Such stagg|enng costs have led to increasing 
public pressure to find ways of reducing medical costs. 



Thus, in addition to the benefit to be gained by an indh 
high risk of the onset of disease far enough in advance 
be taken, substantial reductions in overall medical cost:; 



mav 



idual who could be informed of the 
so that effective preventive steps could 
might be realized by entire 
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to assess or predict an individual's 



Until now, two of the problems inherent in attempting 
future health are: (a) such predictions are imprecise because they are based on data obtained 
from relatively small study samples, consisting of a few hundred or even a few thousand 
subjects, and (b) the predictions require extrapolation to individual persons from the mean 
iand other parameters) of that sample. Such extrapolations are highly problematic with 
respect to reliably estimating the risk of a specific individual, even within a group at high risk 
ibr a specific d 1S ease. This is true, in part, because the statistical procedures that are typically 
used are designed to make inferences about population means, not about individual members 
of the population. 



fut are 
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To obtain quantitative predictions, an "individual's 
occurrence of a specific event within a specified timeframe 
occurrence of a myocardial infarction within the 
death within the next year. Predictions of such events 



health 1 * must be designated as the 
. Two examples are: (a) 
succeeding five years, (b) the individual's 
are necessarily probabilistic in nature. 



Two types of probability are important in this context, 
the probability of the event, before the fact of the event 
post hoc probability of an event is the probability of the 
after the event's occurrence or non-occurrence. Clearly 
is 1 if the event occurred and 0 if the event did not occUr 
priori probability and post hoc probability is worthy o : 



The a priori probability of an event is 
s occurrence or non-occurrence. The 
event after the event is realized, ie., 
the post hoc probability of an event 

The distinction between the a 
note. 



The a priori probability of an event occurring in the subsequent year, or other time interval, 
can be important information. Knowledge of the probability of an event can modify behavior 
or, put another way, the actions one takes (behavior) can depend on the a priori probability of 
an event. This principle is made self evident by consic ering two extreme cases. One would 
almost surely exhibit different behaviors (take differenf actions) under the two scenarios: one 
is informed that one's probability of death in the comirjg year is (a) 0.9999, or (b) 0.0001. 



30 The a priori probability of an event depends upon the information available at the time the 
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S. residents and followed for a period 



A living person will be selected at random from all U 
of one year. At the end of the year the person's vital status (alive or dead) will be ascertained. 
The "event" is "the person died during the year." At tie end of the year the event either 
occurred (person died) or did not occur (person survived) with post hoc probabilities of \ and 
0. respectively- Before the person is selected, the U.S. mortality statistics can be used to 
estimate the a priori probability that the person will die in the year. This probability is 
computed as/?=dW, where N\s the total number of persons in the at risk group (here, all the 
persons in the U.S. population who were alive at the beginning of the year) and d is the total 
number of deaths among the at risk group. For example, the data from calendar year 1 993 are 
(approximately), 2,268,000, « 257,932,000, anc the a priori probability of the event is 
approximately p = 0.0088. [Data from Microsoft Bookshelf 1995 Almanac, article entitled, 
"Vital Statistics. Annual Report for the Year 1993 (Provisional Statistics), Deaths." and Vital 
Statistics of the United States, published by the National Center for Health Statistics.] In this 
game, the a priori probability of the event is based upon very little information, simply that 
the person would be a member of the at risk group, consisting of ail persons who would be 
alive and a U.S. resident at the time of selection. 



Additional information about the at risk group, from 
implies additional information about the subject and 
of the event. For example, continuing the "game" abcjve 



\yhich the subject is selected at random, 
nodification of the a priori probability 
based on 1993 data: 



If the at risk group were the group of U.S. males, i.e., if the subject is known, prior to 
selection, to be a male, the a priori probability of the event is approximately/? = 
0.0093, which is about 6% higher than the case where gender is unknown or 
unspecified. 



If the at risk group were the group of U.S 
known, prior to selection, to be a male in the 
probability of the event is approximately p - 
males where age is unknown or unspecified. 



ma es aged 75-84, i.e., if the subject is 
age interval 75-84, the a priori 
0.0772, or about 8.3 times as high as for 



WO 98/35609 



PCI7US98/02433 



10 



15 



20 



25 



These examples illustrate the general principle that the: 
depends upon the information available at the time the 
accurate estimate of an a priori probability is typical I \ 



information. 



A very accurate estimate of an a priori probability does not guarantee a specific outcome: that 



a priori probability of an event 
probability is evaluated. The most 
the one based on all of the available 



is. the a priori probability for a specific individual ma 
probability. Consider the extreme case cited above, w 
a specific individual in the succeeding year is 0.0001 . 
it is not guaranteed: of all individuals in this "game." 
will survive the year and have a post hoc probability o 
probability, 0.0001) and approximately 1 of each 10,0((o will die and have zpost hoc 
probability of 1, which is very different from the a priori probability. To further elucidate 
this principle, consider a fair coin toss in which the a priori probability of "heads" is exactly 
0.5. Thtpost hoc probability of "heads" is either 0 or 
Thus, the a priori probability for one individual should 



v not be very close to the post hoc 
lere the a priori probability of death of 
Although survival is highly probable, 
approximately 9,999 of each 10,000 
0 (which is ciose to the a priori 



of the post hoc probability for that individual. However, if a very large number of individuals 



"play the game," the mean of the post hoc probabilities 



individuals for whom the event occurs, will be very close to the a priori probability. 



In some cases a person can change, an a priori probabil 
different a priori probability. For example, epidemiolc 
middle-aged male with a high total cholesterol level, in 
level, has a higher a priori probability of death from m 
five years than a comparable person with a much 
research has shown that if the high-cholesterol person 
substantially, i.e., "move" to a much lower cholesterol 
priori probability of death from myocardial infarction 



30 In succeeding paragraphs and sections the word risk wi 



nrinri nrnHaHilitv r\f •> cr>o/-i f\oA avont %„;*k: : r. _ 



, neither of which is very close to 0.5. 
not be considered an approximation 



which is also the proportion of 



ty by "moving" to a group with a 
gists have shown that a U.S. resident, 
eluding a high low-density lipoprotein 
/ocardial infarction in the succeeding 
level. Clinical trial 
reduce his cholesterol level 
group/' he substantially reduces his a 
the succeeding five years. 



lower cholesterol 



dan 



in 



be used in place of the phrase "a 



WO 98/35609 



PCT7US98/02433 



10 



15 



20 



25 



30 



statistical definition of risk as expected loss, where the 
occurs and 0 if the event does not occur. 



scientific research studies on 
This problem results from a 



The foregoing comments illustrate the principle that differing levels of information lead to 
differing a priori probabilities. The risk for a person about whom much is known (i.e.. a 
member of a small subpopulation with many known* characteristics) may be very different 
from the risk for a large subpopulation wjth few knowi characteristics. However, there is yet 
another problem confounding the ability of traditional 
populations to ascertain risk of disease for individuals. 

commonly over-simplified understanding of the causal ion of disease, particularly the 
causation of chronic degenerative diseases such as cancers, cardiovascular diseases, diabetes, 
etc. That is. there is a tendency to believe, for a variety of reasons, that such diseases can 
either be controlled or be clinically indicated by single 

pharmaceutical compound. For example, it has been suggested that breast cancer can be 
controlled by a modest reduction of fat intake, that cobn cancer can be controlled by adding 
specific dietary fiber components, that heart disease is i 
cholesterol, and that stomach cancer can be clinicallv 



ioss function takes value 1 if the event 



vitamin C. These over-simplified views too often prove to be inadequate for identifying 



are too many confounding variables to 
difficulties of extrapolating population 
investigating single constituents, 



causation, particularly for an individual person. There 
be taken into consideration, to say nothing of the great 
data to individuals within the population. Testing and 
among a milieu of thousands if not millions of possible constituent causes, is fraught with 
great uncertainty, especially when attempting to extrapolate these data to the estimation of 
disease risks for individuals. 



clinically indicated by elevated blood 
ndicated by low blood levels of 



dn 



These dual difficulties, (a) of extrapolating data for ex 
to a randomly selected individual and (b), of relying 
occurrence, seriously compromise estimation of futuri 
individual. If an individual's risk for a specific diseas: 
then would be possible to provide information to this 



- - i 1 u-i 



perimental populations of individuals 
single indicators or causes of disease 
disease risk for a randomly selected 
could be determined more. reliably, it 
individual who could then make more 
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nt predicting future health could become an unusuall 
internalize their own health situation and, thus, to takk 
woli beins. 



Moreover, for those individuals identified as being at high risk for a particular disease 
because they may fail within several categories, wherein each individual category is highly 
correlated with a specific disease, such as heart disease, the currently available methodology 
t>pically does not allow one to quantitatively predict when the disease will strike or become 
fatal lor a specific individual with a sufficient reliabili 

that individual; in general, to take effective steps far enough in advance to significantly 
reduce that risk. It would, therefore, be desirable to h{*ve an effective general purpose tool 
thai wouid not only reliably predict onset of a specific 

period, but such a tool would also be useful for monitoring the preventive measures that are 
taken based on such predictions. 
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powerful means for individuals to 
more effective control of their own 



ADVANTAGES AND SUMMARY OF THE INVF1NTTON 



The present invention is directed to providing a tool ftj 
of future disease so that greater effort can be made by 
than the treatment of disease. 



r assessing an individual subject's risk 
that individual to the prevention rather 



More specifically, the present invention is directed at 
quantitatively predicting the risk, for a selected individual 
substantially higher probabilities further into the futuri 
now possible. 



In particular, the present invention is directed toward 
apparatus that provides an on-going system for assessing 
individual, and for monitoring the preventive measure? 
risks for that specific individual. 



The nresent invention identifies sets nf splprtp-H Kinmarlrprc rnntai™™ inf^miot; 



providing a general tool for 

, of a wide range of diseases with 
and with greater reliability than is 



providing a computer-based method and 
future health risks for a specific 
taken so as to reduce future health 
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probability that an individual will acquire a specifiec 
time period or age interval and uses cross-sectional z 
biomarkers to estimate the individual's risk. 



biological condition within a specified 
nd/or longitudinal values of those 



The present invention is further directed, inter alia, 
an individual's future health comprising: 



(*} fl nnmnnfpr rnmnri^ino a 



Still more particularly, the present invention is directed to a computer-based system ior 
predicting future health of individuals comprising: 

(a) a computer comprising a processor containing a database of longitudinally- 
acquired biomarker values from individual membersjof a test population, subpopulation D of 
said members being identified as having acquired a specified biological condition within a 
specified time period or age interval and a subpopulation D being identified as not having 
acquired the specified biological condition within the specified time period or age mterval; 
and 

(b) a computer program that incudes steps for: 

(1) selecting from said biomarkers a <ubset of biomarkers for discriminating 
between members belonging to the subpopulations D and D, wherein the subset of 
biomarkers is selected based on distributions of the biomarker values of the individual 
members of the test population; and 

(2) using the distributions of the selec ted biomarkers to develop a statistical 
procedure that is capable of being used for: 

(i) classifying members of the test population as belonging within a 
subpopulation PD having a prescribed high probability of acquiring the specified biological 
condition within the specified time period or age interval or as belonging within a 
subpopulation /^having a prescribed low probability of acquiring the specified biological 
condition within the specified time period or age interval; or 

(ii) estimating quantitatively, for each member of the test population, 
the probability of acquiring the specified biological condition within the specified time period 
or age interval. 



ti a computer-based system for predicting 
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applying a statistical procedure to said 



from an individual; and 

(b) a computer program that incudes steps for ; 
plurality of biomarker values so as: 

(i) to classify said individual as having a prescribed high probability of 
acquiring a specified biological condition within a specified time period or age interval or as 
having a prescribed low probability of acquiring the specified biological condition within the 
specified time period or age interval; or 

(ii) to estimate quantitatively, for said 
the specified biological condition within the specified 

wherein said statistical procedure is based on 
(1) collecting a database of longitudinally-acqbired biomarker values from individual 
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individual the probability of acquiring 
time period or age interval; 



members being identified as havina 



members of a test population, subpopulation D of said 

acquired the specified biological condition within the specified time period or age interval 
and a subpopulation D being identified as not having acquired the specified biological 
condition within the specified time period or age interval; 

(2) selecting from said biomarkers a subset of biomarkers for discriminating between 
members belonging to the subpopulations D and D, wierein the subset of biomarkers is 
selected based on distributions of the biomarker value* of the individual members of the test 
population; and 

(3) using the distributions of the selected bionjarkers to develop said statistical 
procedure. 



Further objectives and advantages of the present i 
the art from the detailed description of the disclosed irlvenuon. 



BRIEF DESCRIPTION OF THE DRAWING 

Fig. 1 shows the empirical distribution functions ("EDF") 
based on estimation, for Group D (solid line) and Group D 

Figure 2 shows the empirical distribution functions ("EDF 
values, based on minimum random subject effect predictec 



invention will be apparent to those skilled in 



of the linear discriminant function values, 
(dashed Iine)of the Example. 

) of the linear discriminant function 
values for Group D and Group D of the 
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Example. 



DETAILED DESCRIPTION OF THF PREFERR 



The present invention will now be described in detail 
the invention, it being understood that these embodim 
examples and the invention is not to be limited theretc 



D EMBODIMENTS 



for specific preferred embodiments of 
ents are intended as illustrative 



The present invention is based on the theory that an individual's health is, in general, 
influenced by a complex interaction of a wide range o ' physiological and biochemical 
parameters relating to the nutritional, lexicological, genetic, hormonal, viral, infective, 
anthropometric, lifestyle and any other states potential 
and putative pathological states of that individual. Ba : ;ed on this theory, the present invention 
is directed towards providing a practical system for predicting future health using multivariate 
statistical analysis techniques that are capable of providing quantitat.ve predictions of one's 
future health based on statistically comparing an individual's set of biomarker values with a 
longitudinally-obtained database of sets of a large number of individual biomarker values for 
a large test population. The term "biomarker" is used iierein to refer to any biological 
indicator that may affect or be related to diagnosing or predicting an individual's health. The 
term "longitudinal" is used herein to refer to the fact that the biomarker values are to be 
periodically obtained over a period of time, in particular, on at least two measurement 
occasions. 



may vary. For example, some 
ng from as short as 2 years to a period 
such as evaluation of newborn 



The frequency and duration of longitudinal assessments 
biomarkers may be assessed annually, for periods rang 
as long as a total lifetime. Under some circumstances, 

children, biomarkers could be assessed more frequently as, for example, daily, weekly, or 
monthly. Longitudinal assessment occasions may be "irregularly timed," i.e.. occur at 
unequal time intervals. The set of longitudinal assessments for an individual may be 
"complete," meaning that data from all scheduled assessments and all scheduled biomarkers 
are actually obtained and available, or "incomplete," meaning that the data are not complete 
in some manner. An individual's biomarkers may be assessed either cross-sectionallv i P „ 
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one point in time, or longitudinally. The present invention is capable of performing the 
required statistical analyses of data from individuals that have any or all of the characteristics 
noted above, i.e.. cross-sectional or longitudinal, regularly or irregularly timed, complete or 
incomplete. 
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The subject system for assessing future health provides a quantitative estimate of the 
probability of an individual acquiring a specified biological condition within a specified 
period of time. The quantitative probability estimate is calculated using the sequence of 
statistical analyses of the present invention. The subject system may typically be used to 
provide quantitative predictions of future biological conditions for one, two, three, five, or. 
ultimately, even 15-20 or more years into the future. Although the subject system may 
typically be used long before symptoms of a particular disease are usually observed or 
detected, the subject system may also be used for pred cting future health over relatively short 
time periods of only a few months or weeks, or even shorter time periods, as well. 

While there is no upper limit to the number of members that may included in the test 
population, which might eventually include several mi lion test members, a representative test 
population may include far smaller numbers initially. The test population may be selected 
from a much larger general population using appropriate statistical sampling techniques for 
improving the reliability of the data collected. 



acquiring 



In a representative embodiment, the present invention 
that uses a series of statistical analysis steps for creating 
that can be used to estimate an individual's risk of 
within a specified time period or age interval and to identify 
risk. Prior to Phase I of the subject method, the available 
to a Training Sample or an Evaluation Sample; Phases 
Sample and Phase IV operates on data from the Evaiu4tion 
Phase that uses correlation, logistic regression, mixed 
large subset of biomarkers that have potentially useful 



s directed to a computer-based system 
g mathematical-statistical functions 
a specified biological condition 
individuals that are at highest 
subjects may be randomly assigned 
I-III operate on data from the Training 

Sample. Phase I is a Screening 
model, and other analyses to select a 
information_for risk estimation. 
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Phase II is a Parameter Estimation Phase that uses miked 
value vector and structured co variance matrix 
in the presence of incomplete data and/or irregularly 
Biomarker Selection and Risk Assessment Phase that 
and logistic regression to select informative biomarkers 
longitudinal assessments), to estimate discriminant function 
inverse cumulative distribution function and logistic 
risk. Phase IV is an Evaluation Phase that uses the 
estimates to the misclassification rates of the discriminant 



Although the individual steps of the statistical procedures noted in the previous paragraph arc 
described in the statistical literature, it is believed thai these individual steps have never been 
combined in a single overall procedure as disclosed herein. In particular, classical versions of 
the following procedures are described for example, h the Encyclopedia of Statistical 
Sciences , edited by Samuel Kotz, Normal L. Johnson 
John Wiley & Sons, 1985 and in additional literature 
(Volume 2, pp. 193-204), (b) logistic regression analysis (Volume 5, pp. 128-133), (c) mixed 
model analysis (Volume 3, pp. 137-141, article "Fixed-, Random-, and Mixed-Effect 
Models"), (d) discriminant analysis (Volume 2, pp. 3^9-397). The present invention can 
utilize classical versions of these procedures or such t nhancements to and newer versions of 
these procedures as may be developed and published 



PCT/US98/02433 



linear models to estimate expected 
of the Candidate Biomarkers. even 
timed longitudinal data. Phase III is a 
uses discriminant analysis methodology 
(including, where relevant, 
coefficients, and to use an 
degression to estimate each individual's 
Evaluation Sample to produce unbiased 
procedure. 



and Campbell B. Read, published by 
ited therein: (a) correlation. analysis 



rom time to time. 



Correlation analysis is a term for statistical methods 
linear relationship between two or more variables 
variety of types of correlation, including but not limited 
correlations, Spearman's p, Kendall's t, the Fisher 



Yites , 



Logistic regression is a term for statistical methods, i lcluding log-linear models, used for the 
analysis of a relationship between an observed dependent variable (that may be a proportion, 
or a rate) and a set of explanatory variables. The applications of the logistic regression (or 



used for estimating the strength of the 
Correlation, as used here, can include a 
to: Pearson product-moment 
/>, and others. 



*i i: 
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variable is a binary outcome representing an individual's membership in one of two 
complementary (non-overlapping) groups of subjects: a group that will acquire a specified 
disease or condition (sometimes referred to herein as a "specified biological condition") 
within a specified period of time or age interval, and i. group that will not acquire the 
specified disease or condition within a specified periob of time or age interval*. In this context 
the explanatory variables are typically biomarkers or junctions of biomarkers. 



■resenting 



; representing 



Mixed model analysis is a term for statistical methods 
relationships between correlated dependent variables 
observations, longitudinal measurements/observations 
multivariate measurements/observations) and "indepehd 
covariates, such as age, classification variables (rep 
used for analysis of structures and parameters 
measurements/observations. The term "mixed modeh 
random-effects models, and mixed-effects models. Mixed 
nonlinear structures in the expected-value model and/<pr 
model analysis typically includes estimation of expeded 
and covariance matrix parameters (often of the form I\ 
matrices of unknown parameters). A mixed model 
random subject effects (often denoted d k for the k-th 
unbiased predictors" (or "BLUPs") for individual subjects, 
includes procedures for testing hypotheses about expected 
parameters and for constructing confidence regions 



for 



used for the analysis of expected-value 
(multivariate measurements or 
of one variable, and/or longitudinal 
ent variables" that can include 
g group membership) and also 
covariances among correlated 
M includes fixed-effects models, 
models may have linear or 
in the covariance model. A mixed 
value parameters (often denoted p) 
= ZAZ'-rV, where A and V are 
is may also include predictors of 
ect) and so-called "best linear 

A mixed model analysis typically 
value parameters and/or covariance 
parameters. 



anaiys 
siibj 



In particular, discriminant analysis methodology relates 
techniques for developing discriminant functions that 
multivariate observation (e.g., a vector of biomarker 
complementary (non-overlapping) groups of subjects 
specified disease or condition within a specified perioii 
that will not acquire the specified disease or condition 



to statistical analysis methods and 
may be used for assigning a 
values from one subject) to one of two 
[e.g., a group that will acquire a 
of time or age interval, and the group 
within a specified period of time or age 
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function that is used as the basis for calculating an estimate of the probability that a given 
observation belongs in a given group. For the present invention, the observations of interest 
typically comprise a plurality of biomarker values tha: are obtained from each member of a 
large test population or from an individual test subjeci . The discriminant functions of the 
present invention are developed using distributions of these biomarker values for each 
biomarker determined to be of interest. Such distribu; ions plot the total number of individual 
members of the test population having each biomarker value vs. the biomarker value itself. 
Thus, the present invention empoiys a statistical procedure that uses distributions based on 
the individual biomarker values that are obtained for each biomarker from, individual 
members from the test population, as distinct, for example, from using mean biomarker 
values that are obtained from different test populations for the different biomarkers. 



any one of several different types of 
(scalar or vector) into two or more 



The term ''discriminant function" is intended to mean 
functions or procedures for classifying an observation 

groups, including, but not limited to, linear discriminajnt functions, quadratic discriminant 
functions, nonlinear discriminant functions, and various types of so-called optimal 
discriminant procedures. 



prograrn 



t processing 



microprocessor 



compu :er 



The computer-based system of the present invention 
processor that is capable of running a computer 
(hereinafter refined to simply as "the computer program 
performing the required computations and data 
the present invention. The processor may be a 
mainframe computer, or in general, any digital 
programs that can perform the required computations 
typically includes a central processing unit, a random 
memory (ROM), one or more buses or channels for 
components, one or more display devices (such as a " 
devices (such as floppy disk drives, fixed disk drives, 
controlling input-output devices and/or display devices 



\ /~i u 



includes a computer comprised of a 
or set of computer programs 
") comprising the steps for 

in the various steps and phases of 
a personal computer, a 
that is capable of running computer 
ind data processing. The processor 
access memory (RAM), read-only 

of data among its various 
ijnonitor n ), one or more input-output 
printers, etc.), and adapters for 
and/or connecting such devices to the 
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of these components. 



The computer program may be stored in ROM or on 
tangible medium that may be used for storing and dis'tributin 



a disk or set of disks, or in any other 
g computer programs. 



The computer program is capable of performing the 
steps of the analysis on cross-sectional and/or longiti 



computations for the various phases and 
dinal multivariate biomarker data. 



so 



The biomarker data are preferably collected from a test population that is sufficiently large 
that the total number of members acquiring a specified biological condition of interest within 
a two to three year period is large enough for discriminant analysis methodology to be 
meaningfully employed for that specified biological condition. Since one of the features of 
the present invention is directed toward providing a means for using the same database to 
make predictions relating to acquiring any of the maj >r diseases and/or dying from any of the 
major underlying causes of death within as few as oni to two years, the test population is 
preferably large enough to be useful for applying the subject system to any one of the more 
common diseases and underlying causes of death thai account in the aggregate for at least 



of all deaths of interest, wherein the 



about 60%, and more preferably, at least about 75%, 
deaths of interest are herein defined as those of a pathological nature, as distinct from those 
caused by accident, homicide or suicide. 



Cor trol 



accounted 



For example, using data from Center for Disease 
Statistics Report, Supplement, Vol. 44, No. 7, Feb 2t 
75% of ail pathologically derived deaths can be 
causes of death, malignant neoplasms (ICD 140-208) 
distinguished from an age-adjusted death rate, of 205 
diseases (ICD 390-448), 367.8/100,000; chronic 
496), 39.2/100,000; and diabetes mellitus (ICD 250) 
crude death rate of about 880/100,000 for 
the ones which, in fact, have been shown to exhibit 

rpQnrmsivp tn alfe.rpH Hiptarv and life<;rvlp rrmHitinnQ 



pathologically 



and Prevention (Monthly Vital 
, 1996), it can be shown that more than 

for by the following underlying 
having a crude death rate, that is, as 
.6/1 00,000; major cardiovascular 

pulmonary diseases, (ICD 490- 
20.9/100,000; as compared with a total 
derived deaths. These diseases are 
rriajor dietary and lifestyle effects, to be 



obstructive 
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As one of the unique features of the present invention the subject computer-based system and 
apparatus may be used to determine the risk of a specified individual acquiring anv one of 
mcse major diseases based on comparing that individual's profile of biomarker values with 
the biomarker values obtained from members of a lar>e test population. Since it is known 
:hat these major diseases share many common factors that may be reflected in the biomarker 
values, the subject computer-based system may be used to concurrently assess the risk of 
acquiring any of these major diseases. For example, t is known that total serum cholesterol 
is a biomarker that is related to many of these diseases. By monitoring each profile of 
biomarker values that is a significant predictor, in combination with other significant 
biomarker predictors, of a specific disease or underly 

invention to compare that profile with the test populations, an individual subject may be 
informed, with specified quantitative reliability, which disease poses the greatest risk for that 
specific individual. 



those 



A particular feature of the present invention is that 
of acquiring a specified disease may be provided with 
that disease within a specified time period or age interv 
typicai symptoms of that disease are manifest. Armed 
diseases known to be responsive to altered dietary 
may then make those behavioral changes that can 



and 



Furthermore, as more and more data are acquired for larger and larger numbers of subjects 
over longer and longer periods of time, more and mofe refined divisions of each of the major 
diseases and causes of deaths as well as of the less common diseases and underlying causes of 
death can be defined and included in the methodology of the present invention. For example, 
a breakdown can be made in terms of the different types of cancer, e.g., liver cancer, lung 
cancer, stomach cancer, prostate cancer, etc. The present computer-based system, thus, 
provides a means for including ever larger fractions if the population, so as to predict the 



rec uce 



individuals who are at greatest risk 
a quantitative probability of acquiring 
,'al in the future well before any of the 
with that infonmation, for the many 
lifestyle conditions, that individual 
the risk of the disease identified. 
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derived disease within a specified time, wherein the di 
narrower specificity. 



The comprehensive set of biomarkers for which biomirker data are collected from the test 



population preferably includes as many as possible of 
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seases are defined with continuously 



the diverse biomarkers known or 



believed to be related to the most common diseases and underlying causes of pathologically 
derived deaths. In addition, representative clusters of Diomarker values from each of the 
known and generally accepted genetic, physiological a nd biochemical domains of biological 
function may be included. Additional biomarkers that are preferably included are, for 
example, all those that can be measured in biological sjamples that may be stored for analysis 
long after the sample is collected. 



air 



The biological samples preferably include a blood and 
biological samples may be included in the samples that 
of saliva, hair, toenails and fingernails, feces, expired 
biological samples are typically obtained from substantial! 
population. However, in some situations, specific sublets 
only from specific subsets of the population. 



may include, for example, those 
biomarkers listed in Table 1 are 



Concurrent with collecting the biological samples, bionarkcr data relating to nutritional 
habits and lifestyles are also typically obtained from e&ch member of the test population. 
Biomarkers relating to nutritional habits and life styles 
shown in Table 1. While the nutritional- and life-style- 
intended to be illustrative of the types of biomarkers relating to nutritional habits and life 
styles, it is to be understood this list is not exhaustive of the nutritional and life style 
biomarkers that fall within the scope of the present invention. The biomarkers that exhibit 
significant nutritional determinism, as well as the clinical and infections biomarkers, may 
also be determined by other factors, such as by nutritional intake. The delineation of 
categories, (e.g. serum biomarkers, urine biomarkers, questionnaire, etc.), shown in Table 9 
is, thus, only an illustrative division of the categories that may be selected to obtain the 



a urine sample, but still other 
are collected. For example, samples 
, etc. may also be collected. Such 
y every member of the test 
of biomarkers mav be obtained 
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preferably collected and recorded for each member of 
biological sample is taken. 



he test population each time a 



TABLE 1. An illustrative list of biomarkers that may pe used in the subject method for 
predicting future health. 



SERUM BIOMARKERS 
Total cholesterol* 
HDL cholesterol* 
10 LDL cholesterol* 
Apolipoprotein b* 
Apolipoprotein A,* 
Triglycerides* 

Lipid peroxide (Malondialdehyde 
1 5 equivalency:TBA)* 

cc- Carotene (corrected for lipoprotein 
carrier)* 

P-Carotene (corrected for lipoprotein 
carrier)* 

20 y-Carotene (corrected for lipoprotein 
carrier)* 

zeta-Carotene (corrected for lipoprotein 
carrier)* 

a-Cryptoxanthin (corrected for lipoprotein 
25 carrier)* 

P-Cryptoxanthin (corrected for lipoprotein 
carrier)* 

Canthaxanthin (corrected for lipoprotein 
carrier)* 

30 Lycopene (corrected for lipoprotein 
carrier)* 

Lutein (corrected for lipoprotein carrier)* 
anhydro-Lutein (corrected for lipoprotein 
carrier)* 

35 Neurosporene (corrected for lipoprotein 
carrier)* 

Phytofluene (corrected for lipoprotein 
carrier)* 

Phytoene (corrected for lipoprotein 
40 carrier)* 

a-Tocopherol (corrected for lipoprotein 
carrier)* 

y-Tocopherol (corrected for lipoprotein 
carrier)* 



Retinoi binding protein* 

Ascorbic acid* 

Fe* 

K* 

Mgp 

Total phosphorus* 

Ino -game phosphorus * 

Se* 

ZnV 

Fer'itin* 

Total iron binding capacity* 

Fasting glucose* 

Urea nitrogen* 

Uric acid* 

Prealbumin* 

Albumin* 

Total protein* 

Bilirubin* 

Thyroid stimulating hormone T3* 
Thyroid stimulating hormone T4* 
Cotinine 

Aflatoxin-albumin adducts 

Hepatitis B anti-core antibody (HbcAb) 

Hepatitis B surface antigen (GhsAg+) 

Candida albicans antibodies 

Epuein-Barr virus antibodies 

Type 2 Herpes Simples antibodies 

Hu nan Papiloma virus antibodies 

He iocobacter pylori antibodies 

Estradiol (E2) (adjusted for female cycle)* 

Sex hormone binding globulin* 

Pre lactin (adjusted for female cycle)* 

Testosterone (adjusted for female cycle for 

women)* 

Hemoglobin* 

Myristic acid (14:0)* 

Palmitic acid (16:0)* 

Stearic acid (18:0)* 
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iichcnic acid (22:0)* 
Tctracosaenoic acid (24.0)* 
Myrisiicoleic acid (14:1)* 
Palmitoleic acid (16:1)* 
Oleic acid (I8:ln9)* 
Gadoleic acid (20:1)* 
llrucic acid (22:in9)* 
Tciracosaenoic acid (24: 1)* 
Linoieic (I8:2n6)* 
l.inoleic acid (I8:3n3)* 
y-Gamma linoieic (I8:3n6)* 
iiicosadicnoic acid (20:2n6)* 
Di-homo-y-iinolenic acid (20:3n6)* 
Arachidonic acid (20:4n6)* 
Eicospentacnoic acid (20:5n3)* 
Docosatetraenoic acid (22:4n6)* 
Docosapentaenoic acid (22:5n3)* 
Docosahcxaenoic acid (22:6n3)* 
Total saturated fatty acids (16:0, 18:0,20:0, 
22.0.24:0)* 

Total monounsaturated fatty acids (14: i, 
16:1. IS:ln9. 20:1,24:1)* 
Total n3 polyunsaturated fatty acids 
( 1 8:3n3. 20:5n3. 22:5n3, 22:6n3)* 
Total n6 polyunsaturated fatty acids 
( !8:3n6. 20:2n6, 20:3n6, 20:4n6, 22:4n6)* 
Total n3 polyunsaturated/total n6 
polyunsaturated fatty acids (18:3x13, 20:5n3, 
22:5n3. 22:6n3/l 8:3n6. 20:2n6, 20:3n6, 20:4n6, 
22:4n6)" 

Total polyunsaturated fatty acids (I8:2n6, 

1 8:3n3. 18:3n6, 20:2n6, 20:3n6 t 20:4n6, 20:5n3, 

22:4n6,22:5n3.22:6n3)* 

Total polyunsaturated/saturated fatty acids 

(I8:2n6. 18:3n3. 18:3n6, 20:2n6, 20:3n6, 20:4n6, 

20:5n3, 22:4n6, 22:5n3, 22:6n3/I6:0, 18:0, 20:0. 

22:0, 24:0)* 

[About 10-30 genetic markers, depending 
on diseases being investigated] 

URINE BIOMARKERS 

Orotidine 
CI* 
Mg* 
Na* 

Creatinine 



AF 
AF 
AF 



N 7 guanine 
P. ^ 

Qi 



Aflatoxicol 
8-deoxv guanosine 

FOOD DERIVED NUTRIENT 

INTAKES (FROM QUESTIONNAIRE) 

Total protein* 

Animal protein* 

Plant protein* 

Fish protein* 

Lipd* 

'So uble ? carbohydrate* 
Tot il dietary fiber* 
Tot il calories* 

Percentage of caloric intake from lipids* 
Chplesterol* 
Ca 
P* 
Fe* 
K* 
Mg 
Mn 
Na* 
Se* 
Zn* 

Total tocopherols (corrected for lipid 
intace)* 
Total retinoid* 
Total carotenoid* 
Thiamine* 
Riboflavin* 
Niacin* 
Vitsmin C* 

[Ab^ut 30 different ypes of foods]* 
[Ab^ut 30 different fatty acids]* 

RED BLOOD CELLS 
RBG glutathione reductase* 
RBC catalase* 
RBC superoxide dismutase* 

ANTHROPOMETRIC PARAMETERS 
Height* 
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The biological samples are analyzed to determine the 
the biological sample for which a biomarker value is 
component that may be found and measured in a biolo 
the present invention. For example, genetic biomarkers 
sample, as well as the biomarkers that can be measured 
sample, may also be included. 



predi 



Since another feature of the present invention is that of 
useful for predicting disease and death, the biomarker 
previously known to have statistical significance for 
cause of death. Thus, since the total number of biomar 
unlimited in principle, the actual number of biomarken; 
by practical economic and methodological considerations 



>iomarker value for each component in 
desired. It is to be understood that any 
gical sample falls within the scope of 
which may be measured in a blood 
in any other appropriate biological 



identifying new sets of biomarkers 
^ets may include biomarkers not 

cting a specific disease or specific 
<ers that may be used is substantially 
used may, in general, be limited onlv 



Since still another feature of the present invention is th it of providing a computer-based 
system for predicting specified biological conditions within a specific time period or age 
interval in the future, the total number of biomarker va ues may be limited to only those 
biomarker values which have statistical significance for predicting a single specified 



biological condition. Thus, while it is intended that the 



general purpose tool for predicting and monitoring moijt, and, eventually, substantially all 
major types of diseases and underlying causes of death, use of the methodology disclosed 
herein may also be directed to one disease or cause of ceath at a time. 



that 



After being collected, the biological samples may be 
may be stored for later analysis. Since it is expected 
collected in a relatively short period of time and under 
immediate on-site analysis, the samples are preferably 
samples may typically be stored for a substantial perioc 
frozen. The samples are to be stored and transported 
integrity of the samples. Such techniques are describee 



n T : T 1 n n ta: 



subject system is typically used as a 



analyzed immediately or the samples 
a large number of samples may be 
circumstances not conducive to 
stored for later analysis. Because the 
of time, the samples are typically 

conditions that preserve the 
, for example, in Chen, J. f Campbell, 
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Characteristics of 65 Chinese Counties . Oxford, U.K.; 
University Press; Cornell University Press; Peoples Medical 



Ithaca, NY; Beijing, PRC: Oxford 
Publishing House, 1990. 



Use of physical specimens such as biological samples 
samples provide a practical means of providing a rich 
biomarker data that can be collected, stored and analyzed 
techniques. The biological samples are preferably collected 
extended period of time of at least 5-10 years, and most 
such that the quality of the data generated will continuously 
probability predictions. 



iire particularly preferred since such 
^ource of longitudinally-obtained 
using established, cost-effective 
for the test population over an 
preferably, for 15-20 years or more, 
provide more and more reliable 



Since the reliability of the subject system is ultimately 
biomarker data collected, appropriate measures are to 
from all aspects. For example, concerning biomarker 
take appropriate measures to account for the many factprs 
deterioration of the biomarker values over time. 



source 



Furthermore, while the subject disclosure is typically d 
data from physical specimens that are obtained from members 
subject, as well as the biomarker data derived from dietjarv 
individual, use of biomarker data obtained from any 
scope of the present invention. For example, the subjecjt 
use of medical diagnostic data obtained from 
such as electroencephalographic (EEG) data, electrocardiograph 
ray) data, magnetic resonance imaging (MR1), etc., either 
combination with the longitudinally-obtained biomarkejr 
dietary and lifestyle surveys. 



Since the test population is preferably monitored over a 
that a mortality rate will be observed for the test population 



ii 1 
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determined by the quality of the 
taken to assure integrity of the data 
ity, it is necessary to consider and 
which may influence or cause 



be 



stabili 



rected toward obtaining biomarker 
of a . test population or a test 
/ and lifestyle surveys of each test 

falls fully within the spirit and 
methodology may further comprise 
electrophysiological measurement techniques 

ic (ECG) data, radiologic (X- 
alone or, most preferably, in 
data from biological samples and 



period of years, it is to be expected 
that is representative of the 
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identified and the underlying cause of death is record 
system, for example, the established International Statistical 



id. preferably using a known coding 
ical Classification of Diseases and 



Related Health Problems . (ICD-10), Geneva, World 
revision. Other coding systems may also be used wh 
of the present invention. 



health Organization, 1992-cl994, 10th 
le remaining within the scope and spirit 



Using an effective system to identify when a member 
or specified biological condition, morbidity data is also 
biomarker and mortality data of the test population. 



The database of biomarker values preferably includes information from each individual 
recording the dates and ages at the times the biomarkers and biomarker samples are collected 
and recorded, accurate information from the surveillance of the individual recording each 
incident of disease, medical condition, medicai patho 

date of incident. The database includes values of biopiarkers assessed before, during, and 
after each incident, where feasible. 



of the test population acquires a disease 
collected, in addition to collecting the 



Since one aspect of the present invention relates to identifying biomarkers not yet known to 
be statistically significant for predicting future onset of a specified disease or underlying 
cause of death, as many biomarkers as possible are monitored. In a representative 
embodiment, about 200 biomarker values are obtained from each member of the test 
population, although there is substantially no upper lilmit to the number of biomarkers that 
may be used to develop the computer-based statistical analysis methodology. 



Since the present invention is directed toward providing 
predicting a specified biological condition within a s 
substantially complete set of biomarker values is collected 
population at least two different times. More preferably 
or changes with time, a full set is collected at least th* 
biomarker values are collected at periodic intervals f<j>r 



a practical and reliable system for 
pecified period of time or age interval, a 
from each member of the test 
, so as to obtain information on trends 
ee times and, most preferably, the 
as long as practically feasible. 
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In still another aspect of the subjection invention, which is based on the theory that ratios of a 
person's individual biomarker values, or changes in the ratios, may be more important for 
predicting future health than the actual level of any given biomarker value, the discriminant 
function is typically determined using substantially complete sets of biomarker values. Since 
it is recognized that for practical reasons totally complete sets of biomarker values cannot 
reasonably be expected to be obtained from every merrber of the test population on every 
testing occasion, the statistical analysis methodology of this invention includes methods that 
reliably account for incomplete data in a statistically valid manner. 

A further object of the present invention is not only to provide a means of quantitatively 
assessing the risk of future specific diseases, but also to provide a practical tool for defining 
and identifying those biological conditions wherein one has the lowest risk of all future 
diseases. The term "specified biological condition" is, 

invention, meant to include all ranges of health, from the most robustly healthy to the most 
severely diseased. The present invention is, thus, directed towards providing a system for 
monitoring and predicting future health for the most healthy to the least healthy. 

Although the results obtained from the test population may be used for predicting the future 



health of general populations in particular countries, it 
population from the same general population for which 



will be made. Such a limitation is not necessary since it is known that populations of 



s not necessary to select the test 
individual future health predictions 



are characteristic of their home 
copulations possess probabilities of 



individuals who possess probabilities of disease which 
countries, and who then move to new countries whose 
different sets of diseases, will acquire those diseases which are characteristic of the countries 
to which they move. This occurs during a time coincident with and following their 
acquisition of the diet and lifestyle conditions of the new country. That is, all races and 
ethnic groups of the world tend to acquire the same ger eral diseases regardless of their 
inherited characteristics, which may be unique to each i*ace or ethnic group. 



30 One of the specific features of the present invention is that a system is provided for predicting 
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diagnosed. The time of future onset of the specific 
individual can be predicted with a specified quantitaiiv 
applying the subject discriminant analysis methodolo 
large test population. Furthermore, the present inv 
specific health problems further and further into the 
as more and more data are collected for ever larger 
periods of time. 



hqalth problem occurring for a specific 
e probability estimate based on 
:>>' to the database collected from the 
provides a system for predicting 
with greater and greater reliability 
populations for longer and longer 



ention 



future 



test 



The biological samples are typically analyzed for each 
values are desired. For cost and convenience reasons 
samples that may be collected, the samples may be analyzed 
individuals already diagnosed with a disease or who 
the samples have been collected, as well as for a randomly 
of the test population. For example, if the annual mortality 
surveyed is typically in the range of 2-3% annually, a 
produce an annual mortality rate of 6000-9,000 deaths 
deaths would have been caused by each of the major 



need ! 



One of the further features of the present invention 
substantial number of deaths have occurred in the test 
individuals as the ones for whom the biomarker values 
addition, a group of still living test members may ther 
test population. Because of the need to balance the 
samples to obtain statistically significant results with 
system provides a practical method of limiting the and 
samples that will tend to provide the most information 
and more deaths occur in the test population, larger 
analyzed over time. However, the value of the data 
establishing more and more reliable quantitative pred 
less commensurate with the cost of acquiring the 
of the many special features of the present invention 



biomarker for which quantitative 
and because of the large number of 
initially only for those 
during the time period over which 
' selected fraction of the remainder 
rate for the test population 
300,000 member test population would 
, wherein a significant number of 
qnderlying causes of death. 



comprises the step of waiting until a 
population and then selecting those 
are to be determined initially. In 
be selected from the remainder of the 

for large enough numbers of 
he need to control costs, the subject 
lytical measurement costs to only those 
for the least cost. Naturally, as more 
larger numbers of samples will be 
obtained, from the point of view of 
ictions of future health, will be more or 
additional biomarker values. This is another 
that distinguishes it from any known 
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prior art system. This technique of postponing sampl 
until the results obtained tend to have greater practica 



Upon selecting the samples to be analyzed, the biomai 
well known methodologies. Since a large number of 
being measured for a large number of biomarker values 
measurements are typically made using a multi-channel 
BMD/Hitachi Model 747-100 such as manufactured 
Indianapolis, IN. Such analyzers can be designed to 
selected large sets of biomarkers simultaneously using 
sample. For example, the quantity of blood collected 
about 10-30 /il may be required per analytical 
collected is typically about 50 ml, whereas a quantity 
analysis. Appropriately small quantities of other 



is 



measurement 
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analysis permits postponement of cost 
value. 



ker values may be determined using 
amples are to be analyzed with each 
. many, if not most, of these 
analyzer, for example, the 
the Boehringen Mannheim Corp. of 
rrieasure the biomarker values of 
relatively small quantities of the total 
typically about 15 ml, whereas only 
, Similarly, the quantity of urine 
about 100 u\ is required for the 
samples may also be used. 



Of 2 



biological 



Since, in the representative embodiments, physically-preservable biological samples may be 
used, and since only relatively small analytical sample quantities may be used for taking 
measurements at any arbitrarily selected time, typicall> long after the sample has been 
collected, the subject methodology may be effectively applied using any biomarker that is 
detectable within a given sample. For example, although the system may be used initially to 
analyze what are currently deemed to be the more significant biomarkers, the system may be 
readily adapted to include other biomarkers that are not 
for predicting future health. In principle, with adequate 
biomarker that is detectable in the preserved biological 



yet recognized to have significance 
time and economic resources, every 
samples may ultimately be measured. 



Although it may be desirable to acquire substantially complete sets of biomarker values for 
each member of the test population, this is typically veiy difficult to realize especially if the 
samples are to be longitudinally collected from a wide, geographically dispersed population 
base. Using conventional statistical analysis methodology, in which an incomplete set of data 
is typically discarded and not used at all, substantial quantities of data ultimately covering a 
large fraction of the initial test population would nepH ik 
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substantial waste of resources and severe degradation 
by the remaining data. The subject computer-based 
provides a means of using substantially all data colle 
technique for filling in the i: missing values." This is 
methodology, which is based on collecting what 
compared with any prior an studies, for very large 
population that is widely dispersed geographically, 
a diverse large test population is particularly desirablje 
members having widely divergent dietary and lifesty 
human experience. 



For the purpose of describing the present invention, tjhe following terminology is explained 
herein: 



15 A "specified biological condition 1 ' may, for example 



a specified disease, for example, as classified 



of the quality of the results generated 
ijnethodology includes a feature that 
:ted, by using a statistically verifiable 
a particularly useful aspect of the subject 
amqunts to huge quantities of data, as 
niimbers of test members from a test 
Requisition of comprehensive data from 
so as to obtain biomarker values from 
e practice representative of the entire 



refer to any one of the following: 



in International Statistical Classification 



of Diseases and Related Health Problems , supra, (e.g., diabetes mellitus); 



a specified medical or health condition or syndrome 
defined by deviations of biomarker or biomajker 
distributions); 



a specified medical event and its sequelae (e. 
or non-death and stroke-related partial paral 
infarction and subsequent death, or non-death 



premature death from any cause (premaiure 
at. death as projected from the person's gendi 



rl<»ath at s» cnA^l C\(±A sop* 



(e.g., hypertension, as generally 
set values from the usual normal 



g., ischemic stroke and subsequent death, 
ylsis and related conditions; myocardial 
and Mi-related conditions); 



ieath at an age earlier than the mean age 
*r and age at first evaluation); 
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• a newly defined category based on having or ; 
values for a specified set of biomarkers. 



acquiring a specified set of biomarker 



Acquisition or onset of the specified biological conditi 
person does not have the specified biologicai conditio^ 
who subsequently experiences the specified biological 
said to have acquired the specified biological conditio i 
when the person acquired that specified biological cordition 



on refers to the situation wherein a 
at the time of a given evaluation, but 
condition, in which case the person is 
with onset being defined as occurring 



For a specified biological condition and for a population 
not had, the specified biological condition, there are tvfa 
identified as Group D and Group D, and described as follows 



Group D: That subpopulation of persons who 
condition within a specified timeframe. As us^d 
a specified interval of calendar time {e.g., "the 
interval {e.g., ''between 65 and 70 years of age' 
interval. 



of persons who do not have, or have 
complementary subpopuiations, 



will acquire the specified biological 

here, specified timeframe can refer to 
next five years"), to a specified age 
), or to a similar specific time or age 



• Group D: That subpopulation of persons who jvill not acquire the specified biological 
condition within the specified timeframe. 

These subpopulations of subjects are partially characterized by a specific longitudinal pattern 
of data on a (possibly large) number of biomarkers. A longitudinal pattern includes not only 
the level or tissue concentration of a biomarker, but al ;o changes in the level. If one knows 
which longitudinal patterns of biomarkers partially characterize the subpopulations, and has 
the necessary data from a specific person, that person ( an be classified into one of two 
complementary groups, based upon whether the person is projected to belong to Group D or 
to Group D: 

• Grout) PD: That grouo of oersons u/hn at the hpoinnino nf thp cnprifieH timpframp 
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are predicted to acquire the specified biological 
timeframe, i.e., projected to belong to Group 
having a prescribed high probability of acquirijn 
within the specified timeframe. 



Group PD : That group of persons who, at the 
are predicted not to acquire the specified biological 
timeframe, i.e., projected to belong to Group ] 
having a prescribed low probability of acquiring 
within a specified timeframe. 



condition within the specified 
. These persons are described as 
tg the specified biological condition 



beginning of the specified timeframe, 
condition within the specified 
These persons are described as 
the specified biological condition 



The term " prescribed high probability^ may vary in magnitude from having a probability as 
low as a few percent, perhaps even as low as 1% or less, or may be as high as 10%, 20%, 
50%. or even substantially higher, depending on the specified biological condition. For 
example, the increased risk of acquiring lung cancer due to smoking may be perceived by 
many as a significant and preferably avoidable risk, even though the actual several-fold 
increase in risk that is caused by smoking may only bt in the range of a 5-10% probability for 
acquiring lung cancer as far as 15-20 years or more in:o the future. In any case, for each 
specified biological condition for which the system is 
probability may be determined. The ''prescribed low 
the probability of not being in the high risk group for 
condition or, alternatively, the term may be separately : 



applied, a quantifiably prescribed 
probability" may be specified simply as 
acquiring the specified biological 
specified as a concrete value. 



At the point when a statistically adequate number of tie members of the test population can 
be identified as belonging to Group D or Group D, the biomarker values of the members of 
Group D may be compared with members of Group L> using the subject methodology, so as 
to determine a statistical procedure for classifying members into Groups PD and Td or for 
estimating the probability, for each member of the test population, of acquiring the specified 
biological condition within the specified time period or age interval, i.e., the probability of 
belonging to Group PD or the probability of belonging to Group Td . In a representative 
embodiment of the subject invention, the statistical procedure for classifying members into 
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Groups PD zndPD will be a form of a discriminant analysis procedure as described below: 
the procedure may be referred to as a "discriminant procedure" or "discrimination 
procedure/' A Statistically adequate number" may be defined as one for which the total 
number of biomarkers used in the analysis and the total number of test members for which the 
biomarker values are available are each large enough juch that convergence is achieved for 
the computational procedures used in the subject methodology. 

A discrimination procedure has two relevant error rates: 

(1) Proportion of false positives, Le.. the proportion of future subjects who will be 

classified in Group PD but who actually belong to Group £>. 

(2) Proportion of false negatives. Le.. the proportion of future subjects who will be 

classified in Group Td but who actual! 
A representative embodiment of the subject invention 
obtaining accurate estimate of these two error rates. 



y belong to Group D. 

will incorporate methodology for 



and 



Valbes 



A representative embodiment of the subject invention 
multiple steps. The three phases are: 

Phase I. Establish Evaluation Methodology 
Phase II. Reduce the Candidate Biomarkers to 
Discriminatory Power and Perform Mb 
Covariance Structure and Predicted 
Phase III: Calculate the Discriminant Functions 
Predicted Values and Compute Logistic 
Estimate Error Rates for the Discriminant 
Each Phase has multiple steps. Within a phase some 
specific set of steps may be repeated a number of times 
A representative embodiment of the Phases and their 
paragraphs. 



:onsists of three phases, each with 



Select Biomarkers for Consideration, 
a Set of Select Biomarkers that have 
Ned Model Estimation of the 



Using Estimated Means and 
Predicted Values for each Subject; 
Functions, 
groups of steps are iterative; that is, a 

until a specified objective is achieved, 
sfeps are described in the following 
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Phase I. Establish Evaluation Methodology and Select Biomarkers for Consideration. 



The following steps would appear in a representative 



embodiment of the subject invention. 



Step I : Select a methodology- for estimating the procedure 's error rates. 



The methodology may incorporate any statistically 
rates. Two methods, of many that may be used, are 
subsampling (or ^resampling"). 



appropriate method of estimating the error 
raining sample/ validation sample, and 



ufcject 



Training Sample/Validation Sample Method In the 
approach, the test population is randomly divided into 
"training sample" and a '"validation sample. Every s 
assigned to either the training sample or the vaiidatior 
training sample are used in the statistical analyses leaqing 
procedure and probability estimation procedure. The 
sample will be used to estimate the discriminant procedure 
the probability estimates. 



Subsampling Methods "Subsampling" refers to a eta s of statistical methods, including 
jackknifing and bootstrapping, that can be used to produce reduced-bias estimates of error 
rates. In a subsampling method, data from all subjects are used in the statistical analyses 
leading to specification of the discriminant procedure 

estimates. Utilizing all the data can lead to a better discriminant procedure and/or probability 
estimation procedure than would be obtained in the Training Sample/ Validation Sample 
approach, especially: (1 ) if the test population is not lirge, or, (2) if the a priori probability 
of acquiring the biological condition is smalL even wi :h a large test population. In the 
present context, subsampling methods are computationally intensive. 
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training sample/validation sample 
two subsets, identified herein as a 

t (member of the test population) is 
sample. The data from subjects in the 
to specification of the discriminant 
iata from subjects in the evaluation 

*s error rates and the distribution of 
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Step 2. Select the "(raining sample, " i.e.. the subset of the test population to be used for 
statistical analyses leading to the discriminan\procedure/ probability estimation 
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procedure, and the "validation sample, " whicn is the complementary subset. 



I f a subsampling method is to be used, data from all s 
analyses leading to specification of the discriminant 
probability estimates. In this case, the ''training sampl 



ubjc 



to 



test 



if the Training Sample/Validation Sample method is 
contain, approximately, a specified proportion of the 
training sample proportion will be 50%; however, 
\aiuiation sample will contain all subjects not includec 



other 
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ects are used in the statistical 
procedure and/or distribution of 
e" is the entire test population. 



be used, the training sample will 
population. In many cases the 
proportions may also be used. The 
in the training sample. 



The random assignment of subjects to the training sample will typically be stratified on 
subject age. Subject ages are classified into appropriate intervals; an. age-group stratum 
consists of all subjects whose age falls in the specific age interval. Intervals are selected so 
that the number of subjects in each stratum is adequate for the statistical analyses. Within an 
age-group stratum subjects will be randomly assigned :o the training sample or validation 
sample. The randomization is organized to achieve, a F proximately, the specified proportion 
of subjects in the training sample. For example, if the draining sample is specified to include 
75% of the test population, approximately 75% of the lubjects would be randomly assigned 



to the training sample within each age-group stratum, 
years" specifies one age-group stratum, approximately 
would be randomly assigned to the training sample. 



The validation sample, if any, consists of all test population subjects that are not in the 
25 training sample. 



Step 3: Compile a list of Potential Biomarkers that are 



or example, if :i 65 years s age < 70 
75% of the subjects in this stratum 



potential discriminators. 
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The goal of this step is to compile list all reasonable, potentially useful biomarkers. which 
will be called Potential Biomarkers, in a representative embodiment, the list of Potential 
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Biomarkers will include all recorded, quantitative, peijsonal 
test population. The list will include characteristics 
birth) as well as time-dependent characteristics, such 
blood or urine. Non-quantitative characteristics, e.g., 
will be excluded. 
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characteristics of subjects in the 
do not change over time (e.g., date of 
body weight or a lab assessment from 
ihe name of the subject's favorite color. 



that 



as 



Some of the Potential Biomarkers listed in Step 3 will 
remaining steps of this Phase compile a set of "Candidate 
Potential Biomarkers. Each Candidate Biomarker wi 
information from previous research/knowledge, or 
sample data, that the biomarker is a potentially useful 
biomarker that is selected as a candidate is removed 
moved to the set of Candidate Biomarkers. The reason 
Biomarker from the list of Potential Biomarkers: once 
candidate there is no reason to reconsider it; it has 
process, all unselected Potential Biomarkers will be 
the Candidate Biomarkers will be subjected to additional 



; already 



Step 4: Initiate the set of Candidate Biomarkers by including 
on the basis of previous research and expericn 
related to the specified biological condition. 



not be useful for discrimination. The 

Biomarkers" from the Step 3 list of 
1 be selected because there is 
quantitative evidence from the training 
discriminator. At each step, a 

the list of Potential Biomarkers and 
for removing a selected Candidate 
a biomarker has been selected as a 
'made the list." At the end of the 
removed from further consideration; only 
analvses. 



from 



any Potential Biomarkers thai, 
e. are confidently believed to be. 



The objective of this step is to utilize prior information on biomarkers that are potentially 
important discriminants for the specified biological condition. For example, if the specified 
biological condition is acquiring coronary heart disease (CHD) within a specified time, 
previous research has shown that values of serum cholesterol, systolic blood pressure, glucose 
intolerance, or cigarette smoking (to name just a few) ;ire related to onset of CHD and should 
be copied from the list of Potential Biomarkers to the ist of Candidate Biomarkers. 



30 



Any reliable source of information or 'educated guess may be relied upon to select the subset 
of biomarkers known or believed to be related to the specified biological condition. Although 
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ihc identity of the biomarkers initially selected is not critical to determining the identity of the 
subset that is ultimately selected for use in discrimination, the initial selection of biomarkers 
that are ultimately confirmed by this system as having the greatest statistical significance for 
predicting the specified biological condition will assist in providing more rapid convergence 



to the empirically determined subset. In other words, 
the more rapid the convergence. 



the more educated the initial selection. 



Step 5: Add to the list of Candidate Biomarkers any 
"statistically significantly" correlated with the 
Step 4. 



potential Biomarkers that are 

"known important " biomarkers from 



Data from the training sample are used to compute a 
previously identified Candidate Biomarker (which are 
each Potential Biomarker. Any statistically valid correlation 



correlation coefficient between each 
known important" biomarkers) and 
coefficient may be used. 



The goal is to identify biomarkers that may be good di scriminators. A correlate of a "known 
important" biomarker may be a better discriminator thin the "known important" biomarker 
itself. At the least, correlates of known important bioqiarkers should be included in the initial 
analyses. 



If the specified biological condition is actually defined 
{e.g.. hypertension), the defining biomarkers would be 
would have been moved to the list of Candidate Biom4rkers 
defining biomarkers would be moved to the list of Candidate 



fir 



"Statistical significance" is used here only as a tool 
important" and "probably unimportant" correlates. In 
traditional p-value will be computed for a correlation 
Candidate Biomarker. Ifp is less than some specified 
Potential Biomarker is moved to the Candidate Biomaijker 



by values of one or more biomarkers, 
"known important" biomarkers and 
in Step 4. Correlates of the 
Biomarkers in this Step. 



deciding between ' probably 
representative embodiment, a 
between a Potential Biomarker and a 
/alue, e.g. p<0.05 y or p<0.01 y the 
list. 
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Step 6: Fit a logistic regression model for each Potential Biomarker. using a binary 

indicator variable for the specified biological condition as the dependent (Y) variable 
and age and the Potential Biomarker as the independent (X) variables. Add to the list 
of Candidate Biomarkers each Potential Biomarker that is "statistically significant ' 
in its logistic regression model. 



The objective of this step is to select as Candidate 
that are related to the probability of acquiring the specified 
the (linear) effect of age into account. The logistic niodel 
acquiring the specified biological condition as a function 
Biomarker. in conjunction with a subject's age. 



A biomarker is selected (or not) on the basis of a marginal p-vaiue for the biomarker's slope 
in the logistic regression model. As with the correlations above, ''statistical significance" is 
used here only as a tool for deciding between "probably important" and "probably 
unimportant" discriminators. In a representative embodiment, a traditional p-value will be 
computed for the slope of a Potential Biomarker. If/> is less than some specified value, e.g., 
p<0.05, or p<0.0f the Potential Biomarker is moved to the Candidate Biomarker list. 



Biomarkers those Potential Biomarkers 
biological condition, after taking 
expresses the probability of 
of the value of the Potential 



Step 7: Evaluate each longitudinally-assessed Potential 
mixed model ("MixMod") to assess whether 
values are related to acquisition of the specified biological 
Biomarker with a statistically significant longitudinal 
Candidate Biomarkers. 



Biomarker. using a general linear 
longitudinal trends in the biomarker 's 

condition. Each Potential 
(rend is moved to the list of 



The goal of this step is to identify biomarkers, other 
Candidate Biomarker status, that have longitudinal trends 
acquiring the specified biological condition. 



than those previously promoted to 

that are related to the probability of 
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In a typical embodiment of the subject invention, eac i model will be created as follows. The 
dependent variable (Y) in the MixMod contains longitudinal values of the Potential 
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Biomarker. The independent (X) variables for fixed effects are: (1) a binary indicator variable 
for the specified biological condition, (2) age or anott er relevant longitudinal meiameter such 
as time since some germane event, visit number, etc/and (3) the interaction between the 
binary indicator variable for the specified biological condition and the longitudinal 
meiameter. The random effects part of the model includes a random subject increment to the 
intercept of the population regression line and, in some cases, a random slope with respect to 
the longitudinal metameter. When two or more random effects are included, the covariance 



matrix of the random effects is typically unstructured 



metameter is included in the model for the same reasons as in Step 6. 



If the coefficient corresponding to any X- variable other 
Potential Biomarker is moved to the list of Candidate 
significance in Step 6 are applicable here. 



than age is statistically significant, the 
Biomarkers. The remarks on statistical 



At the end of Steps 4-7, all Potential Biomarkers have 
with historical or quantitative evidence of utility as a 
of Candidate Biomarkers. 



Age or another relevant longitudinal 



been examined and each biomarker 
(discriminator has been moved to the list 



Phase II. Reduce the Candidate Biomarkers to a Set of Select Biomarkers that have 

Discriminatory Power and Perform Mixed Model Estimation of the Covariance 
Structure and Predicted Values. 



simple 



Background. Prior art discriminant analysis methodology 
estimates of the mean vectors, ji„ and covariance 
biomarkers (and other variables, such as age and 
(/=!) and Group D (/=2). The fi t are estimated as 
are estimated as simple sample covariance matrices, 
mean for important concomitant variables (or " 
repeated measures from the same subject. Moreover, 
typically based upon a "casewise deletion" procedure 



which i 



covanates 



typically requires relatively precise 
matrices, of the distributions of the 
demographics) of the two groups, Group D 
sample means (vectors) and the 2, 
do not permit adjustment of the 
) and does not readily include 
)rior art discriminant analysis is 
if a subject has any missing data, all of 
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ihai subject's data are deleted from the analyses. 

Given estimates of the mean vectors, |X I2 , and covanance 
(and related data) for a subject in a vector, Y, the traditional 
£, = E 2 , quadratic if 2, ^ 2 : ) are evaluated solely from 
information specific to the particular subject is in the 



matrices, 2„ and the biomarker 
discriminant functions (linear if 
Y, n,, ^, E,, and S : . The only 
vector Y. 



The mixed model procedure, which is the greater part 
procedure by using a general linear mixed model (MixMod) 
Z 2 ; the modeled estimates of these parameters are used 
than the traditional simple, unmodeled estimates. Thi? 
following important improvements over traditional discriminant 
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of Phase II, improves the traditional 
to model all of ft,, fju>, S„ and 
in the discriminant function rather 
MixMod procedure makes the 
analysis: 



The parameters are estimated using a Mixed Model, that: 

♦ uses all available data, i.e., does not use casewise deletion; 

♦ supports covariate adjustment of the es imated expected values (m), with 
corresponding adjustment of the estimated covariance matrices 2,, and 

♦ supports the utilization of repeated measures (e.g., from annual visits) from the 
same subject. 

This MixMod procedure utilizes model-based Estimates of individual random effects 
and ; *BLUPs" ("Best Linear Unbiased Predictors"), in addition to or in place of the 
estimates of the population means ft p which ca!n substantially increase the 
discrimination capability of the discriminant ftinction. 



25 Overview of the Phase II Procedure 

As a result of Phase I, each Candidate Biomarker will 
of utility as a discriminator. However, there are substantial 
Biomarkers. Consequently, a biomarker that considered 
30 discriminatory power, may not make a substantial contribution 
with other biomarkers. In addition, the scales of the 



have historical or quantitative evidence 
correlations among the Candidate 
by itself, has substantial 

when used in combination 
biomarkers mav varv widelv. 
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The objectives of Phase II of the subject procedure are to: 

( 1 ) Rescale the biomarker values so that standard deviations of all rescaied biomarkers are on 

the same order of magnitude (0 < standard deviation s 1 ). 

(2) Reduce the possibly long list of Candidate Biomarkers to a smaller number of ''Select 

Biomarkers/' each of which contributes substantially to the discriminatory power of 
the set. 

(3) Determine the structure of the expected value of th<i vector Y of (rescaied) biomarker 

values using a linear model of the form E[Y] = 
unknown parameters. 

(4) Determine the structure of the covariance matrix of 1 

values using a model of the form E = ZAZ' + V, and to estimate the unknown 
covariance parameters in the matrices A and V. 

(5) Estimate the random subject effect vector, d, k , and compute the predicted- value 

vector, Y u w of the k-xh subject, as if that subject 

biological condition group; /=/ corresponds to Group D and i=2 corresponds to 
Group D. 



Xp ? and estimate P, a vector of 

the vector Y of (rescaied) biomarker 



In a representative embodiment of the subject invention 
once in order to rescale the biomarker data and arrange 
variable in a dataset). Steps 2 and 3 are performed iter^uv 
Biomarkers has been selected and the estimates listed 
refines the mixed model and parameter estimates to be 
appropriate models for the covariance matrices. 



, Step 1 of this Phase is performed 
the data into one data vector (or one 
ely until the set of Select 
above have been computed. Step 4 
ised in the discrimination by selecting 



Step]: Prepare a dataset in which one variable, "ResnScal. 
(including longitudinal measures) of all Candidate 



The scaling is performed separately for each biomarker 
the sample standard deviation of that biomarker. Thus, 
values of each biomarker is 1 .00. In a representative 
the one variable of biomarker values may be named 



\ " contains scaled values 
Biomarkers from all subjects. 



Each biomarker value is divided by 
the standard deviation of the scaled 
embodiment of the subject invention 
R$spScar\ an abbreviation of 
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"Response — Scaled"). The sample standard deviation 
1 .00. This scaling facilitates convergence of the iterati ve 
model computations. 



Step 1 is executed only once. Initially, all Candidate 
are considered members of the set of Select Biomarke 
be removed from the Select Biomarkers in Steps 2-3, 



IjJiomarkers have data in RespScal and 
s. Non-discriminatins biomarkers will 



using 



Step 2: Fit a general linear mixed model (MixMod) 

obtain estimates of the parameter matrices p, i 
subject s random subject effects. d th and each 
and Y ik (avg) as if the subject were in each specified biological 



In a representative embodiment of the subject inventiop the following are specifications of 
the MixMod: 

Dependent ( Y) variable: RespScal ; 
Independent (X) variables and their coefficients (p): 

"Biological Condition Status,' 1 an indicator variable for the status of the 
specified biological condition (c lassification variable); Biological 
Condition Status = 1 if the corresponding element of Y contains 
information about a subject fron Group D and Biological Condition 
Status = 0 otherwise. 
Biomarkers" indicator variables (classification variables); 
Biological Condition Status * Biomarkjs 
variables); 

Age (in years, centered at approximately the overall mean age of subjects; 
continuous variable); 
Random effects variables (Z k ) and random coefficients (effects, d ik ): 

Subject * Biomarker indicator variable s (part of Z k ) and corresponding 
random effects (intercept increnents; pan of d lk ). 
The random subject effect for a specific biomarker is constant across 



of RespScal is also approximately 
procedure in subsequent mixed 



the specifications listed below; 
, and V, obtain estimates of each 
subject 's "predicted values, " Y ik (min) 
condition group. /=/. 2. 



srs 1 indicator variables (classification 
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that subject's multiple 
among repeated 
subject. 
Note that the model assumes 
Covariance matrix, V k =V(e kb ), of the vector e 
, for the £-th subject at the v-th longi 
Candidate Biomarker. This covariance 
each longitudinal evaluation of each 
the model also assumes E[e k J=0. 
The primary interpretation of e kbv is as 
representing variation, from one 
scaled Candidate Biomarker 
value for that scaled Candidate 
is often reasonable to assume 
are uncorrected, i.e.. Cov(e kbv , 
elements of Y are sorted by k 
("visit" or evaluation number or 
model for V k in many cases is V 
V k2 , ...), where V kb = X h l and A. b 
errors for scaled values of the b 
variance is assumed to be the 
evaluations (v). 
Note that the scaling of RespScal impl 
than 1.00. The extent to which 
upon the magnitudes of the fixed 
estimated variance) and the 
effects (diagonal elements of A) 
Note the above combination of Z k , </ k , 
generate a highly structured, 
E ik . To illustrate the point in an 
parameters apply to both Group 
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vjisits, which generates correlations 
measurements of that biomarker for that 



EftJ^O and V[dJ=L. 
kb of biomarker random error terms. e kbv 
itucjinal evaluation of the 6-th scaled 

matrix has one row and one column for 
biomarker for the t-th subject. Note that 

a "random measurement error term/' 
evaluation to another, of a value of the 

subject k % s age-dependent mean 
3iomarker. With this interpretation, it 

values of e kbv are homoscedastic and 
w-bv) = 0if(*,6,v) *(k\b\v-). If the 
(sfibject ID), b (biomarker ID), and v 
age of subject), then a reasonable 
k = BlockDiag(r kb ) - BIockDiag(K kI , 
= V(€ kbv ), the variance of measurement 
th Candidate Biomarker, which 
same for all subjects (k) and all 

tes that each variance, A b , will be less 
ihe variance is less than 1 .00 depends 
effects (a high R 2 leads to a smaller 
magnitudes of the variances of the random 

Y k = BiockDiag(K kb ) and V kb = k b l 
extended compound symmetric model for 
example when the same covariance 
D and Group D, let d k = [d kb ] = [rf u , 
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. J' be the vector of randoiji effects for the Ar-th subject and 6-th 



scaled biomarker. let V(d,) = A 
where b and 6 ' index possibly c 



contain indicator variables for t le scaled biomarkers. and let V kh = A h I 
Then E k = Z t AZ\ - V, = [2 kbb ], where E,, bb = o^J + U = 
covariance matrix of multiple n easuremems from scaled biomarker b. 

2 k . 5b - = 6 bb .J = covariance of scaled biomarkers b and b ' evaluated 
on the same occasion or on different occasions. (Each element of the 
square matrix J equals 1 .) 



cova-iance 



The process of fining the mixed model produces estimates of 
The model's parameters, p, A, and parameters 

covariances for the two Biological Condition 

produces separate estimates of the 
The expected value of each subject's data vectc[r. 

Condition Status group /), 
The expected value of each subject's data vectcjr 

other response group (i *) % 
Each subject's random subject effect in the 

and also as if (he subject were in the 
Each subject's "predicted values," in the subj 

also as if the subject were in the other 
The subject's covariance matrix, E k . If the motfel 

the two Biological Condition Status 

estimates of the covariance matrices S 



other 



ject 



Step 3: Delete the biomarker that has the least « 
mixed model. 



A biomarker that will be an effective discriminant shoyld have a large (statistically 



= (A>h L w bere 6 hb . = Cov(rf, D , rf* h .) 
ifferent scaled biomarkers. lei Z k 



of \\. If the model assumes different 
Status groups, the model 
parameters in A, and V lk , 
M,k , (subject k being in Biological 



, as if the subject were in the 



subject s actual treatment group (/), 
response group (i ') % </ rk . 
's actual treatment group (/): I',/", and 
response group (i'J : Y t k (p) . 

assumes different covariances for 
gro|ups, the model produces separate 



apparent discriminant power and re-fit the 
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significant) Biological Condition Status * Biomarker 
Biomarker main effect is not relevant here: a large Bi 
differences among biomarker means - can arise simph 
types of variables and have different means (on the reseated 
Biological Condition Status * Biomarker effect indicates 
Biological Condition Status = 0 (Group D) is significahtlv 
the Biological Condition Status = 1 (Group D) mean 
difference should make an important contribution to 



the 



fixed effect, in contrast, a iarge 
marker main effect - indicating 
because the biomarkers are different 
axis). In contrast, a large 
that the biomarker mean for the 
y different from biomarker mean for 
the same biomarker Such a 
discrimination procedure. 



fOr 



If each current Selected Biomarker has a statistically si 
y Biomarker fixed effect, Step 3 is completed and we 
Seiect Biomarkers has a not-statistically-significant Bi 
fixed effect, the biomarker with the least statistically 
Condition Status * Biomarker fixed effect is removed 
to Step 2 where a MixMod is fined to the reduced data 



The strategy being implemented in Step 3 is an analos 
procedure in the stepwise regression context. An alternative 
"forward selection/' in which one initially includes onlv 
effective discriminants (biomarkers) in the data vector 
step, adds one more biomarker. 



Step 4: Determine the structures of the covariance parameter matrices, A„ and V th 



pected 



Discriminant analysis methodology uses both the ex 
covariance matrices of the biomarkers (some of which 
separately for each Biological Condition Status group, 
Biomarkers, including possible longitudinal assessments 
Step 3. As noted above, a MixMod incorporates assumptions 
structure for the covariance matrices: 2 ilc = Z i!c AjZ' ik + 
Condition Status group (/=! for Group D, for f=2 for GtauD 



gnificant Biological Condition Status 
rhove to Step 4. If one or more current 
oiogical Condition Status * Biomarker 
significant (largest p-value) Biological 



torn the data vector , Y, and we return 
vector. 



of a "backwards elimination 1 ' 

is to implement an analog of 
a very small number of clearly 
and model and. at each subsequent 



values of the biomarkers and the 
may be evaluated longitudinally) 
D and D. Recall that the list of Select 
, already will have been finalized in 
that lead to the following 
V ik , where i indexes Biological 
D) and k indexes subjects. In 
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addition, the covariance parameter matrices Aj and V ik 
exploited in the analysis, especially when S lk is very 
biomarkers and/or many longitudinal assessments of 



laree 



cne 



The objective of Step 4 is to determine the structure b:' 
and V ik for use in the Phase III discriminant analyses 
covariance parameter matrices tend to be more precise 
covariance parameter matrices. A more precise estimate 
precise estimate of S lk = Z^A^'* + V lk , thence to mqre 
the YJ PJ , and to more precise values of the discriminant 



I longitudinal evaluations of a single 



The overall structure of S, k must take into account the following types of covariances/ 
correlations: 

Type ADB: Covariances/correlations among different biomarkers evaluated at the 

same time point; 
Type ALESB: Covariances/correlations amonj] 
biomarker; 

Type BTBEL: Covariances/correlations betwejen two biomarkers, evaluated 

longitudinally, /.e., covariances/correlations between any pair of biomarkers. 
one evaluated at one time and the other ■ 
In a representative embodiment of the subject invention, the structures described in Step 2, 
above, or extensions of these structures mav be useful. 
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may have structure that can be 
, La., when there are many 
or more biomarkers. 



the covariance parameter matrices A 5 
Estimates of large, structured 
than estimates of unstructured 
of Aj and/or V ik leads to a more 
precise estimates of p, the rf ik , and 
function. 



In a representative embodiment of the subject invention 
Catherine M., and Helms, Ronald W., (1996), "A case 
longitudinal data using mixed (random effects) modeli. 
Meeting of the International Biometric Society, Eastern 
Virginia, March, 1996, are used to explore covariance; 
multivariate data. Selecting a covariance model typical 
MixMods, typically using the same expected-value model 
Models may be compared via Log Likelihood statistics 



, the techniques described in Tangen, 
study of the analysis of multivariate 
presented at the 1996 Spring 
North American Region, Richmond, 
correlation structures for longitudinal 
ly requires fining a number of 

and varying the covariance model, 
(assuming underlying normal 
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distributions). Covariance structures may also be compared graphically using techniques 
J-jvdopcd by Ronald W. Helms at the University of North Carolina, e.g.. Grady. J. J. and 
Helms. R. W. (1995), "Model Selection Techniques for the Covariance Matrix for Incomplete 
Longitudinal Data." Statistics in Medicine. 14, 1397-1416. 

Phase III: Calculate Discriminant Functions Using Estimated Means and Predicted 



Values and Compute Logistic Predicted Va 
Rates for the Discriminant Functions 

ILu kvruumJ The objective of Phase III is to "predict 
v,il| belong to. Group D or Group D: 

• Group D: That subpopulation of persons who 
condition within a specified timeframe. 

• Group D: That subpopulation of persons who 
condition within the specified timeframe. 



A subject is classified by placing the subject into one of the following two groups: 



Group PD: That group of persons who, at the 



ues for each Subject; Estimate Error 



which ''population" or group a subject 



will acquire the specified biological 



will not acquire the specified biological 



beginning of the specified timeframe. 



are predicted to acquire the specified biologicaj condition within the specified 
timeframe, i.e., projected to belong to Group C . These persons are described as 
having a prescribed high probability of acquiring the specified biological condition 
within the specified timeframe. 



Group PD : That group of persons who, at the 
are predicted not to acquire the specified biological 
timeframe, i.e., projected to belong to Group D 
having a prescribed low probability of acquiring 
within a specified timeframe. 



beginning of the specified timeframe, 

condition within the specified 
. These persons are described as 
the specified biological condition 
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A second objective is to estimate the probabilities tha: 

D. 



The technology for achieving the first objective - classify 
groups ~ uses discriminant procedures that are modifiicat 
analysis. The estimates of the probability that the subject 
will acquire the specified biological condition is obtained 
logistic regression, (1) using the discriminant function 
discriminant variables as regressors. 



As noted in the background of Phase IL prior an discriminant 
utilizes naive estimates of the mean vectors, }i it and c 
distributions of the biomarkers of the two groups. Moreover 
is typically based upon a "casewise deletion" procedure 
of that subject's data are deleted from the analyses. 



The mixed model procedure, described in Phase II, improves the traditional procedure by 
using a general linear mixed model (MixMod) to model all of ji lf ^ E„ and E 2 ; the modeled 
estimates of these parameters are used in the discriminint function rather than the traditional 
simple, unmodeted estimates. The use of the mixed model permits the present procedures to 
make the following important improvements over traditional discriminant analysis: The 
parameters are estimated using all available data, i.e., does not use casewise deletion. The 
procedure supports covariate adjustment of the estimat-d expected values 00, with 
corresponding adjustment of the estimated co variance matrices E,. .And the procedure 
supports the utilization of repeated measures (e.g., from annual visits) from the same subject. 



a subject will belong to Groups D and 



vine a subject into one of the two 
ions of traditional discriminant 
will be in the group of subjects that 
from a modification of traditional 
values as regressors and (2) using the 



analysis methodology typicallv 
co variance matrices, E„ of the 

, prior an discriminant analysis 
: if a subject has any missing data, all 



Perhaps more importantly, the use of the mixed model 
utilize model-based estimates of individual random 
Unbiased Predictors"), in addition to or in place of the 
which can substantially increase the discrimination 



permits the present procedures to 
effects and "BLUPs" ("Best Linear 

estimates of the population means n„ 
capability of the discriminant function. 



WO 98/35609 



PCT/US98/02433 



The form of the present discriminants are formally 
based upon multivariate normality. Some notation is 



identical to the traditional discriminant 
useful: let: 



/ denote the density function of the distribution! 
variables for a subject from group /. ev 
= 1 for Group D or Group PD, /=2 for 

p x denote the a priori probability that a subject 
D> i-2 for Group D. The values of the 
or other research. If the values of the p t 
subjects in the two groups may be used 



of the vector V of discriminant 
iluated using "estimates" of ^ and S„ / 
Group D or Group TB ; 
will come from group /, / = ! for Group 
?i are often known from historical data 
are unknown, the proportions of the 
as estimates of the p r 



of disqrim 



Then a subject of unknown group with vector Y 
classified into group 1 (Group PD) if Ln[f x {Y)lfjy)] > 
group 2 (Group PD) otherwise. 



In Phase II one will have decided whether one can 
equal covariance matrices, E,= 2 ; = 2. In that case, 
reduces to use of a linear discriminant function of the 



/XY) = [Y-^ I+M2 )]»E-- (|ll 



where the ^ and E< are replaced by "appropriate" 
compares Z)(Y) vs. 0. If, in Phase II, it was decided tha 
reduces to use of a quadratic discriminant function 



of the 



200 = v* in(\z 2 \ i - ■/ 2 (Y- f i l y2V' (Y-n l} + 



where the p t and 2, are replaced by ''appropriate" ( 
compares Q(Y) vs. 0. 



In either case, the "appropriate" estimates come from the mixed model orocedure in Pha<e IT 



nam function values would be 
LnfpJ pj and would be assigned to 



reaspnabl 



y assume the two groups have 
tlhe present discriminant procedure 



fallowing form: 



M:) - Ln[pJ pj 



estimates to be discussed below. One 

£, * S 2 , the discriminant procedure 
following form: 



4(Y-^)'2V' (Y-^) -Ln[pj p j 



estimates to be discussed below. One 
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Phase HI Procedure The steps of Phase III of the procedure are described below. It is 
assumed .that data are available from one or more ; *new" subjects, i.e.. subjects whose group 
membership is unknown and that were not used in the Phase II mixed model computations. 
In Steps 1-2 we shall consider one subject at a time. Some additional notation is useful. Let / 
= 1 for Group D or Group PD, /=2 for Group D or Group TB and let: 



Y denote the vector of values of the discriminc nt variables for one new subject. The 

elements of Y are scaled as RespScal was scaled in Phase II. 
X, denote the matrix of values of the independent variables used in the final Phase II 

mixed model, as if the subject were in iroup /, 7=1,2. Note that the rows of 

X; correspond to the rows (elements) of Y. 
Z, denote the matrix of values of the random efect variables used in the final Phase II 

mixed model, as if the subject were in £roup L i = 1 , 2. Note that the rows of 

Zj correspond to the rows of Y. 
£- t denote the estimated covariance matrix of the 

2, from the final Phase II mixed model. 



j denote the estimated covariance matrix of ttfe 
from group /, / = 1,2, from the final PHase 
cases the mixed model reduced to a single 



e random effects from group /, i = 1 , 
Note that in many cases the mixed 



model reduced to a single covariance fc r the random effects, i.e.. = £ 2 = £ 



random residuals or "error terms" 
II mixed model. Note that in many 
covariance matrix, i.e., $ { = tf 2 = 



= Z, £ t Z* + j denote the estimated 
II mixed model, as if the new subject 
many cases the mixed model reduced to 



covariance matrix of Y, from the final Phase 
from group /, / = 1,2. Note that in 
a single covariance matrix, i.e.. t> l = 



a me 
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Step i Using results from the Phase II mixed model, classify all subjects in the validation 



candidate discriminant procedures. 



sample and estimate the error rates of multiple . 
one based on ' estimated values. " and others based upon "predicted values " utilizing 
various combinations of the estimated random subject effects. The procedure uith the 
lowest estimated error rate is selected procedure and is referred to as the "apparently 
most reliable procedure. " 



If the original study population was divided into a ; 
sample^ use the validation sample in the following; otherwise 
"validation sample." Estimate the following quantities 
sample, separately, as if the subject came from each grcjup 



trailing sample" and a "validation 

use the training sample as the 
for each subject in the validation 



f t - X t p, the "estimated value" of Y, as if the s 
= £ Z t %- l (Y- Xj £), the estimate of the subjf 

subject came from group /, / = 1, 2. 
S mtn = I if I s I; otherwise S 

"minimum" of £, and <L, or the "minimum 

effect" estimate, 
davj; = (d , + d 2 )/2. £ avg may be thought of as the 

"average (over groups) random subject e 
Y! a " n ' - X, + Zj 3 min , the subjects "predicted v 

group /, i=l f 2, but using the "minimunp 
Yf*> = x { (? + Z t £ avg , the subject's "predicted 

group /,/=!, 2, but using the "average 



In the above and below, / = 1 for Group D or Group PD 



Classification based upon the estimated values, Y: 



ubject came from group /", / = 1,2. 
cfs random subject effect, as if the 

= £ : . £ min may be thought of as the 
(over groups) random subject 

'average" of and or the 



fecf estimate. 

alues," as if the subject came from 
random subject effect estimate, 
as if the subject came from 
random subject effect estimate. 



values 



, /-2 for Group D or Group PD 



If the decision 2, = 2 2 = S was made in Phase I, evaluate the linear discriminant 
function, D(Y) (above), substituting . for ft H t for 2. Assign the subject to 
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group i (Group PD) if D(Y) * 0: otherwise assign ihe subject to group 2 ( Group PD 



If the decision E, ^ E : was made in Phase II. 
function, 0(Y) (above), substituting tf, for 
subject to group 1 (Group PD) ifO(Y) ^ 0; 
(Group PD ). 



eiv: 



Classification based upon the "minimum " random subject effects and predicted values. 



'aiuate the quadratic discriminant 
ti for 2„ / = ] . 2. Assign the 
otherwise assign the subject to group 2 



and 



If the decision E, = E 2 = £ was made in Phase 
function, D(Y) (above), substituting Y! mm> for fi, 
group 1 (Group PD) if D(Y) z 0- otherwise 
)■ 



assign 



If the decision E, * £, was made in Phase II, e 
function, 0(Y) (above), substituting Yf mmi for g 
subject to group 1 (Group PD) if 0(Y) z 0\ otherwise 
(Group^D ). 



Classification based upon the "average " random subject effects and predicted values, \T» 



If the decision £, = £, = £ was made in Phase 
function, D(Y) (above), substituting Y t fax * for ^ i 
group 1 (Group PD) if D(Y) z 0\ otherwise 

)■ 



If the decision E, * E : was made in Phase II, evaluate the quadratic discriminant 



II, evaluate the linear discriminant 
and t for £. Assign the subject to 
the subject to group 2 (Group?D 



/aiuate the quadratic discriminant 
and £, for / = 1,2. Assign the 
assign the subject to group 2 



II, evaluate the linear discriminant 
and £ for E. Assign the subject to 
the subject to group 2 (Group 



assign 



function, Q(Y) (above), substituting for ^ 



subject to group 1 (Group PD) if O(Y) * 0; otherwise assign the subject to group 2 



and t t for Ej, / = l, 2. Assign the 
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(Group/ 5 /) ). 

After each subject in the validation sample (as defined 
table, similar to the following, for each of the three 
based upon predicted values): 



above) is classified, compute a 2 * 2 
procedures (based on estimated values or 



Numbers of subjects in the 
validation sample tabulated by 
actual and classified membership 
in D. 



PD 
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Subject was classified as a member of Group: 



PD 



Subject was actually 
a member of Group: 



N u = Number 
negative classijf 



or true 
Ications 



A ; 'i 2 = Number of false 
positive classifications 



N 2 \ = Number of false 
negative classifications 



N Z2 = Number of true 
positive classifications 



15 



20 



25 



Further, compute separately for classification based on 
based upon predicted values: 



estimated values and for classification 



a> p = .V !2 = false positive error rate = proportion of false positive classifications 

= A' 2I W 2m = false negative error rate = proportion of false negative classifications 
r w = ^12 + A^i V^i- + A>) = total error rate = proponion of false classifications 

In a typical embodiment of the subject invention, one will compare the three types of 
classification procedures, /.e., the one based on estimated values, tf h the one based on 
"minimum" predicted values, Yf mifl \ and the one based on "average" predicted values, K/"* 
to determine the "apparently most reliable procedure.'" Some considerations in the selection 
process are: 



If a false negative classification hp* ^nKctontio! I 
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false positive classification, select the procedure 
rate, /> N . This situation could arise, for examp 
persons who will suffer a myocardial infarction 
age group. A false negative classification, failikre 
probability, could have more serious consequences 
warning a low-probability person that they hav 



with the smaller false negative error 
e, if Group D is the subpopulation of 
("MI") wiihin a specified five year 
to warn a person of a high Ml 
than a false positive classification, 
a high MI probability. 



Conversely, if a false positive classification has substantially more serious 
consequences than a false negative classification, select the procedure with the smaller 
false positive error rate, r FP . 

When there is no a priori reason to assign greater seriousness to either a false negative' 
or a false positive classification, select the procedure with the smaller total error rate, 



The procedure selected as the apparently most reliable brocedure is used to classify subjects 
into the two groups, Group PD and Group PD . 



Step 2: Use two types of logistic regression to compute 
new subject will belong to each group. 



estimates of the probability that a 



The data from the training sample are used to fit a logistic regression model in which the 
value of the discriminant function (D(Y) if linear, 0(Y) if quadratic) for each subject is used 
as the independent ("A"') variable and the Biological Condition Status (indicator variable for 
membership in Group D) as the dependent ( 4 T") variable. The model is used, together with 
inverse logistic transform, to compute for each subject in estimate of the probability that the 
subject will belong to Group D. 



sample 



In a separate calculation, the data from the training 
model in which the biomarkers used in the discriminant 
mixed model covariates (variables in X), are incorporated 
the Biological Condition Status (indicator variable for 



are used to fit a logistic regression 
function, together with the final 
as independent ("A"') variables and 
rhembership in Grouo D^l as the 
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the 



dependent («*}'") variable. In addition to obtaining 
estimates, the model is used, together with inverse logistic 
subject an estimated probability that the subject will 
data are used, the model is used to estimate the probab 
Group D at the end of the specified period. One can 
approach with a logistic link function to accommodate 
binomial outcomes from one subject. 



usual logistic regression model 
transform, to compute for each 
t^elong to Group D. When longitudinal 
ility that the subject will belong to 

a generalized estimating equation 
correlations among the multiple 



use 



The predicted probabilities from these two models can 
discriminant function values. 



While the subject algorithm is the preferred embodiment 
function to be used in the subject, it is to be understooc 
for the purpose of illustrating the preferred embodimem 
case is it to be understood that the subject invention is 
algorithm described herein. For example, it is to be 
discriminant analysis methodology, there are other 
called "optimal discrimination," other types of reg 
that may also be used while falling fully within the 



types 



>ressipn 



scope 



This invention will now be described in detail with 
embodiments thereof, the materials, apparatus and 
examples that are intended to be illustrative only. In 
to be limited to the statistical methods, materials 
and the like specifically recited herein. 



AN EXAMPLE OF THF PRFFERRFH Fluimni M ^ NT 



The attached tables and Figure present the results of an 
methods and procedures of the subject invention. 



provide interesting interpretations of 



for determining the discriminant 
that this algorithm is provided solely 
of the subject invention, and in no 
imited to the steps or substeps of the 
understood that in the an and field of 
of discriminant functions, e.g., so- 
, e.g., nonlinear mixed models, etc., 
and spirit of the subject invention. 



respect to specific representative 
process steps being understood as 
pahicuiar, the invention is not intended 
conditions, process parameters, apparatus 



illustrative analysis of data using the 
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The data used as a basis for this example were obtain 
whom Sickle Cell data are acquired on an annual bas 
consecutive visits. However, since patients typically 
annually, the database includes many patients for w 
two annual visits. Database information that was 
clinical chemistry data, and hematological data. 



used 
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d from a database including patients for 
s. Some patients have data from three 
cannot be compelled to participate 
data are available from only one or 
here included demographic data. 



him 



The specified biological condition of interest (the "dis ease" or "affliction") in this example 
was an occurrence of a painful crisis that required hospitalization. At each annual visit the 
subject is asked (and records are checked to determine) if the subject had a painful crisis that 
required hospitalization in the preceding year. Each subject who reported having had a 
hospitalization for a painful crisis at any visit (any year) is a member of the "Diseased" group 
(Group D); all other subjects are members of Group E . 



Whenever a subject had had a painful crisis that requiri 
year, all data that were collected after the hospitalization 
year or in later years, were excluded from the analysis 
be used if the outcome were death or occurrence of a 
that records a subject's Group D membership {e.g.. disjeased 
the "Disease Status" variable. 



d hospitalization in the preceding 
for the painful crisis, in the same 
This mimics the procedure that would 
4hronic. incurable disease. The variable 
or not. afflicted or not) is named 



The following is an example of the statistical analysis procedures using the sickle cell data. 
For reasons of confidentiality, the data used in this example are artificial and do not come 
from a real study or from real subjects. However, the jdata are similar to data that could have 
been obtained in a study of real subjects. 
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Phase I. Establish Evaluation Methodology and Select Biomarkcrs for Consid 



Step I : Select a methodology for estimating the procedure j error rates. 
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Jeration. 



Step 2. Select the "training sample, " i.e.. the subset o 
statistical analyses leading to the discriminant 
procedure, and the ' validation sample. " which 



yfthe test population to be used for 
procedure/probability estimation 
is the complementary subset. 



The Training Sample/Validation Sample Method was 
Patients were randomly assigned to one of the two sanl 
create the discriminant function; the validation sample 
the discriminant function. 



c hosen for this example, 
pies. The training sample was used to 
was used to evaluate the accuracv of 



annual' 



The training sample included information from 641 
or about 1.3 annual evaluations per subject. However, 
when a subject made a visit. For an extreme example, 
(variable L_DBILI) were available from only 80 subjects 



' evaluations from 481 subjects, 
not all biomarkers were assessed, even 
enly 88 values of Direct Bilirubin 



Step 3: Compile a list of Potential Biomarkers that 



are 



In this case, blood pressures, all available demographic 
hematological data were used as potential discriminators 
in Table 2. 



Step 4: Initiate the set of Candidate Biomarkers by 

on the basis of previous research and experience 
related to the specified biological condition. 



potential discriminators. 



data, clinical chemistry data, and 
The Potential Biomarkers are listed 



including any Potential Biomarkers that, 
are confidently believed to be 



In the example, Platelet Count (or "Platelets") was taken 
Status, hospitalization for a pain crisis. 



as a "known" biomarker for Disease 
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Step 5: Add to the list of Candidate Biomarkers any Potential Biomarkers that are 

'statistically significantly ' correlated wiih th<> "know important " biomarkers fi 
Step 4. 

Biomarkers were selected that were correlated with th 
platelets, from Step 2. A summary of these correlations 
labeled "Correlation W/ Platelets". The *p n column sHows 
Platelets. A biomarker was selected on the basis of a 
product-moment correlation coefficient. In the exam^l 
The 7> -cv ' column indicates, by the presence of the 
became Candidate Biomarkers as a result of a "si°nifi£ant 



: "known imponant" biomarker, 
is shown in Table 3, in the columns 
the/?-values for correlations with 
arginal p-vaiue for the Pearson 
e,/> < 0.01 was required for selection. 
"YES," those biomarkers that 
' correlation with Platelets. 



word 



Step r> Fit a logistic regression model for each Potential 
indicator variable for the specified biological 
and age and the Potential Biomarker as the 
of Candidate Biomarkers each Potential Biomarker 
in its logistic regression model. 



om 



lal Biomarker, using a binary 
Condition as the dependent (Y) variable 
independent (X) variables. Add to the list 
that is "statistically significant" 



A logistic regression model was fined for each biomarker, using Disease Status as the 
dependent (Y) variable and a combination of age and tjie biomarker as the independent (X) 
variables. In this case, for each biomarker the logistic 
probability of a hospitalization for a painftil crisis is described by that biomarker, in 
conjunction with the subject's age. Roughly speaking, the biomarkefs regression coefficient, 
or slope, in the logistic regression will be approximate y zero if there is no relationship 
between the biomarker and the probability that the sub ect will acquire the specified 
biological condition; a nonzero slope indicates a relationship. A summary of the logistic 
regression results is shown in Table 3, in the columns headed "Logistic Regression." The y 
column shows thep-values for the biomarker's regression coefficient. A biomarker was 
selected on the basis of a marginal p-value for the biomarker's slope in the logistic regression 
model. In the example, p < 0.01 was required for selection. The ><cv M column indicates, 
by the presence of the word "YES," those biomarkers tjhat became Candidate Biomarkers 



as a 
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result of a ''significant" logistic regression coefficient 
were also significantly correlated with Platelets and 
logistic regressions were computed. 



Note that some of these biomarkers 
v^ere Candidate Biomarkers before the 



Step Evaluate each longitudinally-assessed Potential 
mixed model ("MixMod") to assess whether 
values are related to acquisition of the specified biological 
Biomarker with a statistically significant longitudinal 
Candidate Biomarkers. 



Biomarker. using a general linear 
longitudinal trends in the biomarker 's 

condition. Each Potential 
(rend is moved to the list of 



A mixed model was fitted for each biomarker, using icngitudinal values of the biomarker as 
the dependent (Y) variable, with Age, Disease Status, 2nd Visit Number x Disease Status as 
the independent (X) variables, and a subject effect in the random effects (Z) pan of the model. 
(Visit Number and Disease Status are "classification" variables; the corresponding 
coefficients are increments to an intercept. In contrast. Age is a continuous variable whose 
coefficient is a slope.) The random effects part of the rpixed model incorporates the 
correlations between longitudinal measurements from 
the number of visits (longitudinal assessments) to vary 



:he same subject. The model permits 
from subject to subject. 



A biomarker could be selected if either the Disease Status i; main effect" or the subvector of 
three Visit Number * Disease Status interaction coefficients was statistically significantly 
different from zero (p<0.Q\). A significant Disease Staus "main effect" would indicate that 
the mean of the biomarker values for subjects in Group D is different from the mean for 
subjects in Group D. A significant subvector of three Visit Number * Disease Status 
interaction coefficients would indicate that the time tre id in biomarker values for subjects in 
Group D is different than time trend for subjects in Group D. In either case (significant main 
effect or interaction), the results would indicate that the biomarker is a potentially useful 
discriminator and should be moved to the Candidate Biomarker list. The results from the 
mixed models are shown in Table 3 in the columns headed Mixed Model. Separate results 
are shown for main effects and interactions, in a format 
and logistic regressions. 



similar to results from correlations 
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At ihe end of Steps 4-7. all Potential Biomarkers hav 
with historical or quantitative evidence of utility as a 
of Candidate Biomarkers. The Candidate Biomarkers 
Table 3 in the column headed "Selected." 
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been examined and each biomarker 
iiscriminator has been moved to the list 
are indicated by the word "YES" in 



Phase II. Reduce the Candidate Biomarkers to a 
Discriminatory Power and Perform Mixed 
Structure and Predicted Values. 



Set of Select Biomarkers that have 
Vlodel Estimation of the Covariance 



Step 1: Prepare a dataset in which one variable ' 

(including longitudinal measures) of all Candidate 



ReqpScal " contains scaled values 

Biomarkers from all subjects. 



This step was executed for the example but the results 
when all the values of all the different biomarkers are 
vector can contain a large number of elements. 



Step 2: Fit a general linear mixed model (MixMod) 

obtain estimates of the parameter matrices p. i 
subject s random subject effects. d th and each 
and Y ik (avs:> as if the subject were in each sped, 



Step 3: Delete the biomarker that has the least , 
mixed model. 



in 



Steps 2-3 are repeated iteratively until all biomarkers 
In the interests of conserving space in this presentation 
the iterations through Steps 2-3 are discussed. Steps 2 
15, with Age as a fixed effect covariate. 



are not shown. However, note that 
olaced into one column vector, Y, the 



utyng the specifications listed below; 
and V, obtain estimates of each 
subject 's "predicted values, " Y ik <nttnl 
pecifted biological condition group, /=/, 2. 



apparent discriminant power and re-fa the 



the model are statistically significant, 
of an example, only the final results of 
3 reduced the number of biomarkers to 



General information for the example mixed model is e 



ven in Table 4. Data were availahlp 
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from 4S I patients with a maximum of three visits for 
numbers of observations nor used in the analysis 
missing }' values to compel the software to compute 
artificial observations with missing }' values have no 
or prediction of random subject effects. 



each patient. Note the apparently large 
ial observations were generated with 
required predicted vaiues. The 
mpact on the estimation of parameters 



Anific 



tie 



Tabic 5 gives the estimates of the fixed effects from tie mixed model. The p- value for each 
biomarker <*.*. the Rvalue for "L_BUN") is ap-valJe for a test of the hypothesis that the 
moan vaiue of this biomarker is the same as the overa 1 mean, averaged over all biomarkers. 
The fact that these p-values are significant is of little interest; one expects the mean of one 
biomarkers values to be different from the mean of another biomarkers values. 



from 



In Table 5 the/?- value for each "biomarker X GROUP 
•ALBUMIN X GROUP IA") is a p-value for a test of 
the biomarker for Group D is significantly different 
Group D. A significant value (e.g.. p < 0.05) indicates 
discriminator. All of the interactions in the final model 
statistically significant (all p <. 0.05). Age was forced 
p- value is not significant. 



IA" interaction (e.g.. the p-value for 
the hypothesis that the mean value of 
the mean value of the biomarker for 
that the biomarker should be a good 
represented by Table 5 are 
to remain in the model even though the 



Subject-, biomarker-, Disease Status ("Group")-, and 
values for subject 447 are shown in Table 6. This subj 
D?"=NO: note "RESPSCAL" is missing for rows with 
Predicted values for both groups. Note also that this 
or MCHC for Visit 2, but we have model-based 
MCH and MCHC. 



The strategy implemented in Steps 2-3 is an analog of a "backwards elimination" procedure 



in the stepwise regression context. An alternative w 



VlSll- 



specific observed and predicted 
fct was in Group D ("GROUP 

GROUP D?"=YES), but we have 
ubject had no data for biomarker MCH 
predicted values for that subject's Visit 2 



- "-5'—"" nn alternative would be to implement an analog of 

"forward selection," in which one initially includes onljy two (or very small numbers of) 
clearly effective discriminants (biomarkers) in the model and. at each sub^nuem «,,„ „ 



^•f*n aHrk 
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Step 4: Determine the structures of the covariance pai 



ameter matrices, A r and V tir 



nto account three types of co variances/ 



As noted above, the overall structure of E ik must take 
correlations: 

Type ADB: Covariances/correlations among different biomarkers evaluated at the 
same time point; 

Type ALESB: Covariances/correlations among longitudinal evaluations of a single 
biomarker; 

Type BTBEL: Covariances/correlations between two biomarkers, evaluated 

longitudinally, i.e.. covariances/correlaions between any pair of biomarkers 

one evaluated at one time and the other 
In the example the following structures were ultimatel 
Identical random effects covariance parameter 

D, i.e., A, = A 3 = A and 
A has compound symmetric structure, 6 H = 0.6669, 6- = 0.0097 for /*/ 
Type ADB covariances in matrix V, which is Me same for both Group D and Group 

D, and compound symmetric structure, 



evaluated at a different time, 
y obtained: 

matrices for both Group D and Group 



This covariance structure was reasonable given the sic 



Estimates of A and V are shown in Table 7. The 
random subject effects, is in the top of the table. The 
25 biomarkers used in this model; the columns are labeled 



The estimate of V, the covariance matrix of the within 
bottom of the table. As with A, the rows and columns 
in this model. V has compound symmetric structure. 



Phase III: Calculate Discriminant Functions Using 



v.^0.3267, V.-0.0151 for/*/. 



:le cell data at hand. 



estimate of A, the covariance matrix of the 
ijows and columns correspond to the 1 5 



subject, within-visit errors, is in the 
correspond to the 15 biomarkers used 
which is reasonable for the scaled data. 



Estimated Means and Predicted 
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Values and Compute Logistic Predicted Valp 
Rates for the Discriminant Functions 
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es for each Subject; Estimate Error 



* e "° r ***** U*t« , 
ow based on ~es, im a,e^ues. - W 4 L rf v^^..,^. . 

»*- e„ Wom i, iycc , c#cc „ rfepro „^ ^ ^ 
/new/ reliable procedure. " 



The presen, procedures were applied using the mixed model results for the sickle ce„ data 
Smce the .variance parameter matrices were modeled p be ecual for Group D and Group D 
each discriminant was a linear discriminant. Each discr m 



the training sample (used here as a validation sample), 
either Group PD or Group 75 



inant was applied to the subjects in 
projecting each subject to belong to 



based on estimated values is shown 



. An evaluation of the subject linear discriminant function 
in Table 8. Of , 79 subjects in Group £>, the Disease Status = "No" g^ToO ^ZZ 
correctly classified by the discriminant in, Group 75 and 79 (44 8 o/ o) were incorrectly 
classified ,nto Group PD. Of 262 subj ec £ s in Group D, I D i sease Stalus . . w group . , gg 
(7- /o) were correctly classified into Group PD and 74 (2 %) were incorrect]y „ ^ 
Group PD . Overall, of 44 1 subjects, 288 subjects (65%) 
were misclassified. 



Table 9 displays an evaluation of the subject linear 
values using the minimum random subject effect. Table 
discrimination led to a slight improvement of discriminat 
results in Group D. Overall, the error rate was 



discriminant 



were correctly classified and 35% 



function based on predicted 
is similar to Table 8. Prediction 
on in Group D, but slightly worse 
approximately the same. 



30 
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rates than are likely to occur in practice, because the training sample was used both to derive 
the discriminant function and to evaluate it. Evaluaiioh of the discriminant function using the 
evaluation sample will produce unbiased estimates of ihe misclassification rates. Resamplinc 
techniques such as jackknifmg or bootstrapping can produce less biased estimates while still 
using data from the training sample. 



Step 2: Use two types of logistic regression to compute 
new subject will belong to each group. 



Two types of logistic regressions are fitted to the trainin 
discriminant functions. In both logistic regressions, the: 
dependent (" Y") variable. In the first logistic regression 
functions based on estimation is used as an independer t 
logistic regression, the value of the discriminant functions 
independent ("X') variable. In a third logistic regressio 
discriminant function are incorporated as independent ( 
used in the fixed effects pan of the mixed model, and tt 
dependent (T) variable. The estimates from the logi 
compute, for each subject, an estimated probability that 
(Disease Status "Yes") group. The results of the logistij: 
displayed in tables. 



nstlic 



0 



points 



Figure 1 displays the empirical distribution functions (" 
function values (based on estimated values) for Group 
line). To prepare the graph, the data for the subjects are 
within a group, by increasing values of Z)(Y). Data 
EDF value starts at 0 (before the first subject's data are 
subject, where n is the number of subjects in that group, 
separately for each group. In Figure 1, the fact that the 
of the EDF for Group D indicates that Group D tends to 



estimates of the probability that a 



g sample data for each of the 
Disease Status indicator is the 
, the value of the discriminant 
("A"*) variable. In the second 
based on prediction is used as an 
n, the biomarkers used in the 
\X") variables, along with covariates 
e Disease Status indicator is the 
regression models are used to 
the subject belongs to the diseased 
regression computations are not 



EDF") of the linear discriminant 
(solid line) and Group D (dashed 
sorted by Disease Status group and, 
are plotted in that sequence. The 
plotted) and increases by 1/n for each 
Thus, the EDF climbs from 0 to 1 , 
IDF for Group D is shifted to the left 
have lower scores than Group D . 
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One can see that roughly 72% of Group D subjects ha 
separation point between Group PD and Group PD ), < 
subjects' EDF values to the left of 0. The steepness of 
line at LDF=0 indicates that many subjects are "borderl 
possible that if an additional year of followup had been 
Group D (in these data) would have had pain crises in 
"convened" to Group D. 



'c D(Y) values less than 0 (the 
hile Group D has about 44% of their 
the groups* EDF lines near the vertical 
ine" and are difficult to classify. It is 
available, a number of subjects in 
he subsequent year and would have 



10 



The empirical distribution functions ("EDF") of the mi 
discriminant function values for Group D (solid line) 
Figure 2. The results and interpretations are similar to 
group's EDF lines are even steeper, in the vicinity 
emphasizing the fact that many subjects are borderline 



mipimum random subject linear 

Group D (dashed line) are shown in 
those in Figure 1 . However, the 
=0. in Figure 2 than in Figure 1 , 



ofLDF 



15 



These Figures reveal, as do the statistics above, that the 
classifies subjects who ultimately must be hospitalized 
data available in this example, the procedures are less c 
be so hospitalized. 



discriminant procedures effectively 
for a pain crisis but, for the limited 
Tective for the subgroup who will not 
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Variable Name 


Description 










AGEYR 


Age of patient (years) 




ALBUMIN 


Albumin (g/dL) 




ALKPHOS 


Alkaline Phosphatase (i 


/L) 


BMI 


Body Mass Index (Wt./ 1 


it.2) 


BP_DIAST 


Diastolic Blood Pressun 


i (mm Hg) 


BP_SYST 


Systolic Biood Pressure 


(mm Hg) 


CALCIUM 


Calcium (g/dL) 




CL 


Chloride (meq/L) 




C02 


Carbon Dioxide (mmol/L 


) 


GENDER 


Gender of patient (M/F) 




HBA2 


Hemoglobin A2 (%) 




HCT 


Hematocrit (%) 




HEIGHT 


Height (cm) 




HGB 


Hemoglobin (g/dl) 




K 


Potassium (mmol/L) 




L_ALKPH 


Log10 of Alkaline PhosD 


latase 


L_ALT 


Log 10 of Alanine Transa 


minase 


L_AST 


LogiOof Aspartate Tran 


saminase 


L BUN 


Log 10 of Blood Urea Niti 


ogen 


L_CR 


Log 10 of Creatinine 




L DBILI 


LogiOof Direct Bilirubin 




L HBF 


Log10 of Hemoglobin F 




L_LDH 


Log10 of Lactic Dehydro 


genase 


L_TBILI 


LogiOof Total Bilirubin 




LJJRICA 


LogiOof Uric Acid 




MCH 


Mean Corpuscular Heme 


globin (mg/dL) 


MCHC 


Mean Corpuscular Heme 


globin Concentration (b/dL) 


MCV 


Mean Corpuscular Volun 


te (fl) 


NA 


Sodium (meq/L) 




PHOSPHOR 


Phosphorus (mg/dL) 




PLATELET 


Platelet Count (x 109/L) 




RBC 


Red Blood Cell Count (x 


109/L) 


RETIC 


Reticulocyte Count (%) 




TOTPROT 


Total Blood Protein (g/L) 




WBC 


White Blood Cell Count ( 


L 109/L) ~j 


WEIGHT 


Weight of patient (kg) 
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j Logistic 


| Regression | 
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Cl 




NA 


























YES I 






YES I 


YES I 


YES I 






Table 3. Summary of Phase 1, Steps 5-7, the Selection of Candidate Biomarkers from a List of 
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0.06 


0.24 


0.32 


0.38 


0.08 


0.46 


tn 
b 


b 


CD 

b 


0.38I 


0.35 


■o.eel 


0.76I 
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| Potential Biomarkers using Correlation, Logistic Regression, and Mixed Models. 
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0.10| 




| Correlation 


| W/ Platelets 
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1 NA 
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Table 5. Estimates of Fixed Effects Coefficients and Related Statistics 
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Table 8. Evaluation of the Discriminant Procedure Using Estimated Values 



Numbers of subjects in the 
validation sample tabulated by 
actual and classified 
membership in D. 


Subject w; 


js classified as a member of Group: 


* 


>o 
4o 


PD 
Yes 


Subject was 
actually a member 
of Group: 


D 

No 


A/„=t00 
r„ =56% 




A/ 12 =79 

r 12 = r FP =44% 


D 

Yes 


A/ 21 =74 
r 2 , = r FN =2 


8% 


A/ 22 =188 
r a =72% 



r m = 153/441 = 35% 
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Table 9. Evaluation of the Discriminant Procedure Using 



PCT/US98/02433 
Predicted Values 



Numbers of subjects in the 
validation sample tabulated by 
actual and classified membership 
in D. 



Subject was classified as a member of Group: 



PD 
No 



PD 

Yes 



Subject was actually 


6 




= 105 


a member of Group: 


No 


r„ = 


59% 




D 


A/ 2 , 


=81 




Yes 


r 2 i = 


r FN = 31% 



A/ 12 =74 

^2 = ^ = 41% 



W 22= 181 
r a = 69% 



/■„,= 155/441 = 35% 
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h of individuals comprising: 



What Is Claimed Is : 

i A computer-based system for predicting future heal 

(a) a computer comprising a processor containing a database of longitudinally-acquired 
biomarker values from individual members of a test population, subpopulation D of said 
members being identified as having acquired a specified biological condition within a specified 
nine period or age interval and a subpopulation D beino identified as not having acquired the 
specified biological condition within the specified time period or age interval: and 
f h i a computer program that incudes steps for: 

(1) selecting from said biomarkers a subset of biomarkers for discriminating 
hciuccn members belonging to the subpopulations D apd D, wherein the subset of biomarkers is 
selected based on distributions of the biomarker values 
population: and 

(2) using the distributions of the selected biomarkers to develop a statistical 
procedure thai is capable of being used for: 

(i) classifying members of the test population as belonging within a 
subpopuiation PD having a prescribed high probability of acquiring the specified biological 
condition within the, specified time period or age interval or as belonging within a subpopulation 
PD having a prescribed low probability of acquiring the 
specified time period or age interval; or 

(ii) estimating quantitatively, for 
probability of acquiring the specified biological condition within the specified time period or age 
interval. 



of the individual members of the test 



2. The computer-based system of claim 1 wherein the 
discriminant function utilizing the estimated mean 
the distributions of biomarker values within the subpopi 



3. The computer-based system of claim 2 wherein 
the selected biomarkers are obtained by fitting a genera 



77 



specified biological condition within the 
each member of the test population, the 



statistical procedure comprises a 
vectors and estimated co variance matrices of 
ilations D and D. 



estimates of parameters of the distributions of 
linear mixed model to the biomarker ■ 
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dam from the test population. 
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4. The computer-based system of claim 2 wherein: 

(a) . the estimated mean vectors are modeled as 
parameters or values of covariates; or 

(b) estimated covariance matrices are modelec 
parameters or values of covariates. 



5. The computer-based system of claim 4 wherein estimates of parameters of the distributions of 
the selected biomarkers are obtained by fitting a general linear mixed model to the biomarker 
data from the test population. 



vector-valued functions of expected-vaiue 



as matrix-valued functions of covariance 



6. The computer-based system of claim 4 wherein an 
incorporates an estimate of the realized value of a random 
being classified or of a member for whom a probabilitv 



estimated mean vector or probability 
subject effect vector for a member 
is estimated. 



7. The computer-based system of claim 6 wherein 
the selected biomarkers are obtained by fining a general 
data from the test population. 



estimates of parameters of the distributions of 
linear mixed model to the biomarker 



8. A computer-based system for predicting future heal(h 

(a) a computer comprising a processor containing 
individual members of a test population, subpopulation 
having acquired a specified biological condition within 
and a subpopulation D being identified as not having 
within the specified time period or age interval; and 

(b) a computer program that incudes steps for: 

(1) selecting from said biomarkers a subset 
between members belonging to the subpopulations D apd 



ac 
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of individuals comprising: 
a database of biomarker values from 
D of said members being identified as 
a specified time period or age interval 
quired the specified biological condition 



of biomarkers for discriminating 
D, wherein the subset of biomarkers is 
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test population as belonging within a 



selected based on distributions of the biomarker values of the individual members of the test 
population; and 

(2) using the distributions of the selecjed biomarkers to develop a statistical 
procedure that is capable of being used for: 

(i) classifying members of the 
subpopulation PD having a prescribed high probability of acquiring the specified biological 
condition within the specified time period or age interval or as belonging within a subpopulation 
P D having a prescribed low probability of acquiring tjie specified biological condition within the 
specified time period or age interval; or 

(ii) estimating quantitatively, fir each member of the test population, the 
probability of acquiring the specified biological condi 
interval; 

wherein the statistical procedure comprises a d 
mean vectors and estimated covariance matrices of the 
the subpopulations D and D. 



ion within the specified time period or age 

iscriminant function utilizing the estimated 
distributions of biomarker values within 



9. The computer-based system of claim 8 wherein 
the selected biomarkers are obtained by fitting a generkl 
data from the test population. 



estimates of parameters of the distributions of 
linear mixed model to the biomarker 



10. The computer-based system of claim 9 wherein: 

(a) the estimated mean vectors are modeled as 
parameters or values of covariates; or 

(b) estimated covariance matrices are modeled 
parameters or values of covariates. 



1 1. The computer-based system of claim 10 wherein an 
incorporates an estimate of the realized value of a ranaoni 
being classified or of a member for whom a probability 

79 



vector-valued functions of expected-value 



as matrix-valued functions of covariance 



estimated mean vector or probability 
subject effect vector for a member 
is estimated. 
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the biomarker value: and 

of biomarker values so as: 
a prescribed high probability of acquiring 

period or age interval or as having a 



12. A method of predicting an individual's health conprisins: 

collecting a plurality of biomarker values from an individual, wherein at least one of said 
biomarker values is obtained by physically measuring 
applying a statistical procedure to said pluralit 

(i) to classify said individual as having 
a specified biological condition within a specified time 

prescribed low probability of acquiring the specified biological condition within the specified 
lime period or age interval; or 

(ii) to estimate quantitatively for said i idividual the probability of acquiring the 
specified biological condition within the specified time period or age interval; 

wherein said statistical procedure is based on : 

( 1) collecting a database of longitudinally-acq lired biomarker values from individual 
members of a test population, subpopuiation D of said members being identified as having 
acquired the specified biological condition within the j pecified time period or age interval and a 
subpopuiation D being identified as not having acquired the specified biological condition within 
the specified time period or age interval; 

(2) selecting from said biomarkers a subset of ^markers for discriminating between 
members belonging to the subpopulations D and D, wLein the subset of biomarkers is selected 
based on distributions of the biomarker values of the individual members of the test population; 
and 

(3) using the distributions of the selected biomarkers to develop said statistical 
procedure. 



13. The method according to claim 12 wherein at least 
from a biological sample. 



14. The method according to claim 13 wherein said 
sample. 



one of said biomarker values is obtained 



biological sample is a serum or urine 
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1 5. A computer-based system for predicting an indiv dual's future health comprising: 

(a) a computer comprising a processor containing a plurality of biomarker values from 
individual; and 

(b) a computer program that incudes steps for] applying a statistical procedure 10 said 
plurality of biomarker values so as: 

(i) to classify said individual as having a prescribed high probability of acquiring 



period or age interval or as having a 



ndividual the probability of acquiring the 



a specified biological condition within a specified time 

prescribed low probability of acquiring the specified tfiological condition within the specified 
time period or age interval; or 

(ii) to estimate quantitatively for said 
specified biological condition within the specified time period or age interval; 
wherein said statistical procedure is based on : 

(1) collecting a database of longitudinally-acquired biomarker values from individual 
members of a test population, subpopulation D of said members being identified as having 
acquired the specified biological condition within the 
subpopulation D being identified as not having acquired the specified biological condition within 
the specified lime period or age interval; 

(2) selecting from said biomarkers a subset of! 
members belonging to the subpopulations D and D, wierein the subset of biomarkers is selected 
based on distributions of the biomarker values of the individual members of the test population; 
and 

(3) using the distributions of the selected bionlarkers to develop said statistical 
procedure. 



biomarkers for discriminating between 



16. The computer-based system of claim 15 wherein 
individual includes longitudinally-acquired biomarker 



1 7. The computer-based system of claim 1 5 wherein 
due to a specified underlying cause of death within the 

81 



ihe plurality of biomarker values from said 
values. 



ihe specified biological condition is death 
specified time period or age interval. 
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1 S. The computer-based system of claim i 5 wherein 
specified morbidity within the specified time period or 
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the specified biological condition is a 
aae interval. 



19. The computer-based system of claim 15 wherein 
least two years. 

20. The computer-based system of claim 15 wherein 
least three years. 



the specified time period is a period of at 



t ie specified time period is a period of at 



2 1 . A method for assessing an individual's future risk 
of death comprising: 

collecting a plurality of biomarker values from 
biomarker values is obtained by physically measuring 

applying a statistical procedure to said plurality 
whether said individual is classified as having a prescribed 
specified time period or age interval, from any one of the 
in the aggregate for at least 60% of all deaths in a test 
or age interval. 



22. A method for assessing an individual's evidence o 
collecting a plurality of biomarker values from 
biomarker values is obtained by physically measuring 
applying a statistical procedure to said plurality 
whether said individual is classified as having a prescribed 
a specified time period or age interval, from any one of 
account in the aggregate for at least 60% of all deaths 
period or age interval. 



of death from specified underlying causes 

an individual, wherein at least one of said 
he biomarker value; and 
of biomarker values so as to determine 
high probability of dying, within a 
underlying causes of death that account 
jiopulation over the specified time period 



good health comprising: 
an individual, wherein at least one of said 
t|he biomarker value; and 
of biomarker values so as to determine 

high probability of not dying, within 
the underlying causes of death that 
a test population over the specified time 



23. A computer-based system for assessing an individual's future risk of death from a specified 
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underlying cause of death comprising: 

(a) a computer comprising a processor 
individual; and 

(b) a computer program that incudes steps for 
plurality of biomarker values so as to determine whether 
prescribed high probability of dying, within a speci 
of the underlying causes of death that account in the 
test population over the specified time period or age interval 



containing a plurality of biomarker vaiues from an 

applying a statistical procedure to said 
said individual is classified as having a 
ified time period or age interval, from any one 
aggregate for at least 60% of all deaths in a 



dual 



24. A computer-based system for assessing an indivi 

(a) a computer comprising a processor containing 
individual; and 

(b) a computer program that incudes steps for 
plurality of biomarker values so as to determine whether 
prescribed high probability of not dying, within a spec 
one of the underlying causes of death that account in 
in a test population over the specified time period or 



said 



25. An apparatus for assessing an individual's risk of 

(a) a storage device for storing a plurality 

(b) a processor coupled to the storage device 

1 ) to receive from the storage device 

2) to apply a statistical procedure to 

(i) to classify said individual as 
having a prescribed high probability of acquiring a 
specified time period or age interval or as belonging 
prescribed low probability of acquiring the specified b 
time period or age interval; or 

(ii) to estimate quantitatively th 

83 



wi 



PCT/US98/02433 



's evidence of good health comprising: 
ing a plurality of biomarker values from an 

applying a statistical procedure to said 

said individual is classified as having a 
fied time period or age interval, from any 
aggregate for at least 60% of all deaths 
interval. 



the 



a$e 



uture health problems comprising: 
of biomarker values from an individual; and 
and programmed: 

ssjid plurality of biomarker values; and 
plurality of biomarker values so as: 
belonging within a subpopulation PD 
specified biological condition within a 
thin a subpopulation PD having a 
ological condition within the specified 

£ probability for said individual acquiring 
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the specified biological condition within the specified 
wherein said statistical procedure is based on : 

(1) collecting a database of longitudinally-acquired 
members of a test population, subpopulation D of saic 
acquired the specified biological condition within the 
subpopulation D being identified as not having acquired 
the specified time period or age interval; 

(2) selecting from said biomarkers a subset 
members belonging to the subpopulations D and D, wherein 
based on distributions of the biomarker values of the 
and 

(3) using the distributions of the selected 
procedure. 
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time period or age interval; 

biomarker values from individual 
members being identified as having 
specified time period or age interval and a 
the specified biological condition within 

of biomarkers for discriminating between 

the subset of biomarkers is selected 
individual members of the test population; 

bion arkers to develop said statistical 
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